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1.  Speech  Analysis  by  Linear  Prediction 

Up  to  the  present,  most  of  the  effort  has  been  devoted  to  the 
evelopment  of  an  interactive  speech  analysis  system,  which  is 
implement  on  the  Fast  Digital  Processor.  The  analysis  procedure 

sch  7  tGChniqUe  °£  U"ear  P^aictive  analysis.  l„  this 

radiati  !  ""‘T*  *“■"*  °£  9l°ttal  Source'  vocal  tract,  and 
this  represented  by  a  single  all-pole  filter.  In 

way,  spectral  frames  are  constrained  to  have  a  fixed  numhe 
resonant  peaks,  which  are  located  at  the  positions  of  the  formats 
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retain  the  important  information  about  the  formant-  The  .  , 
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samples ,  and  the  convenient  storage  of  the  computed  spectral  frames. 
It  is  possible  to  mark  and  edit  the  data  in  flexible  ways,  and  to 
compute  and  attach  auxiliary  calculations  to  the  basic  data.  Any 
of  these  computed  parameters  can  be  easily  displayed  and  examined. 

The  system  will  be  used  to  createa  very  largo  data  base  of 
processed  utterances  for  use  in  speech  recognition  research,  which 
is  the  next  phase  of  this  project.  Disyllabic  utterances  of  the 
form  /haC!VC2/  will  be  recorded  for  all  possible  c1#  C2  combina¬ 
tions,  and  for  a  wide  range  of  the  vowels  V.  The  goal  will  then 
be  to  describe  the  acoustic  phonetic  parameters  of  C1  in  these 
utterances.  While  there  are  many  known  contextual  effects  on  the 
phonetic  realization  of  sounds,  it  is  felt  that  these  are  mini¬ 
mized  for  consonants  in  pro-stressed  position.  Furthermore, 
stressed  syllable  nuclei  are  reliable  anchor  points  at  which  to 
initiate  phonemic  recognition.  For  these  reasons,  recognition  of 
pre-stressed  consonants  can  be  expected  to  be  at  least  as  reliable 
as  that  of  consonants  in  other  positions,  and  hence  a  substantial 
effort  toward  their  recognition  is  justified. 

We  expect  that  this  study  will  represent  a  comprehensive 
attack  on  this  problem,  leading  to  the  creation  of  an  exceptionally 
large  data  base,  together  with  a  wide  range  of  techniques  for 
consonant  recognition  and  a  critical  evaluation  of  their  capabilities. 

2.  Reconstruction  of  Multidimensional  Signals  from  Projections 

In  many  applications,  a  set  of  projections  of  an  N-dimensional 
object  onto  (N-l)  dimensions  are  available  from  which  it  is  possible 
to  reconstruct  the  original  object,  x-ray  photographs,  as  obtained 
in  medical  applications  and  the  inspection  of  mechanical  systems, 
represent  two-dimensional  projections  of  the  three-dimensional 
objects  which  have  been  X-rayed.  A  number  of  new  results  for  this 
problem  have  been  obtained  by  formulating  the  reconstruction  problem 
directly  in  digital  signal  processing  terms.  Based  on  this  formula¬ 
tion,  several  algorithms  have  been  developed  which  appear,  based  on 
several  reconstruction  examples,  to  be  superior  to  previous  algorithms 
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in  some  cases.  Most  of  the  reconstructions  performed  have  been 
the  transformation  of  one-dimensional  projections  to  two-dimensional 
photographs.  A  reconstruction  of  a  section  of  leg  bone  from  real 
x-ray  data  has  also  been  performed.  This  work  has  resulted  in  the 
completion  of  an  Sc.D.  thesis  (R.M.  Merserau;  "Digital  Deconstruc¬ 
tion  of  Multidimensional  Signals  from  their  Projections") ,  the 
abstract  of  which  is  attached.  Some  of  the  algorithms  are  also 
summarized  in  the  paper  "The  Digital  Reconstruction  of  Multidimen¬ 
sional  Signals  from  their  Projections"  by  R.M.  Merserau;  Proc.  10th 
Annual  Allerton  Conference  on  Circuit  and  System  Theory,  Oct.  4-6, 
1972,  Monticello,  Ill.,  pp.  326-334. 

Some  preliminary  investigations  into  the  use  of  projections 
for  picture  bandwidth  compression  have  also  been  completed,  which 
have  led  to  promising  results.  If  a  function  can  be  represented  by 
its  projections,  then  perhaps  pictures  can  be  transmitted  or  stored 
by  utilizing  a  set  of  projections  for  these  pictures,  thus  using 
fewer  bits  than  are  required  for  transmitting  the  entire,  picture 
directly.  These  experiments  are  summarized  in  the  doctoral  thesis  by 
Merserau. 

3.  Development  of  a  Digital  Processor  for  Speech  Synthesis 

Detailed  design  of  the  "Black  Box"  processor  has  been  completed. 
There  have  been  two  major  changes  in  design,  and  several  smaller  ones 
as  the  specific  logic  has  been  developed.  First,  a  major  decision 
has  been  made  to  build  the  processor  from  ECL  10k  logic,  rather 
than  74  series  and  Schottky  TTL.  This  change  will  increase  the 
speed  of  the  processor,  but  it  should  also  result  in  greater  system 
reliability.  It  is  our  belief  that  with  proper  design  precautions 
as  currently  understood,  ECL  systems  should  be  more  reliable  and 
immune  from  noise  problems  than  TTL  systems.  The  second  major  change 
has  been  an  increase  in  data  word  length  from  18  to  24  bits  to  per¬ 
mit  retention  of  more  significant  figures.  The  new  multiplier  will 
be  16  X  24,  and  should  run  in  less  than  100  nsec.  The  memory  uses 
1024X1  ECL  chips  which  are  just  now  becoming  available,  and  which 


provide  adequate  storage  at  state-of-the-art  speeds. 

Large  circuit  boards  have  been  ordered,  and  ECL  parts  will  be 
ordered  shortly.  The  only  remaining  design  problems  concern  the 
exact  nature  of  the  PDP-9  interface,  which  must  now  be  modified  to 
allow  for  24-bit  data  words. 

An  internal  document,  "The  Black  Bex",  is  attached  which 
describes  the  design  of  the  processor  in  detail. 

4.  Design  of  Two-Dimensional  Recursive  Digital  Filters 

Recent  work  has  been  based  on  using  the  one-projection  results 
of  Mersorau  to  allow  the  inference  of  two-dimensional  structures 
from  their  one-dimensional  projections.  In  this  way,  one-dimensional 
approximation  theorems  can  be  used  to  simplify  the  two-dimensional 
recursive  approximation  problem.  The  theory  for  the  design  of  two- 
dimensional  recursive  filters  whose  magnitude-squared  frequency 
response  approximates  a  desired  function  is  complete,  and  the 
algorithm  has  been  programmed  and  is  currently  being  debugged. 

There  are,  however,  theoretical  problems  concerning  the  realizability 
of  the  designed  filters.  These  problems,  which  must  be  circumvented 
before  the  design  algorithm  can  be  considered  complete,  are  the  focus 
of  current  effort. 

Additionally,  a  simple  inverse  filtering  program  was  written 
using  the  FDP-Univac  picture  processing  system  developed  last  year. 
Digital  photographs  have  been  blurred  by  a  recursive  lowpass  filter 
(taking  approximately  10  seconds) ,  and  then  reconstructed  exactly 
using  a  non-recursive  high-pass  filter  (taking  approximately  5 
seconds) . 
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ABSTRACT 

Several  algorithms  for  approximating  a  multidimensional  density 
function  in  terms  of  its  projections  ("x-ray  photographs")  at 
different  orientations  are  presented.  This  is  accomplished  by 
means  of  a  theorem  which  states  that  the  Fourier  transform  of  a 
projection  of  a  function  is  a  slice  (central  section)  of  the 
Fourier  transform  of  that  function.  Special  emphasis  is  given 
to  bandlimited  functions  as  these  are  readily  digitized.  Some 
theorems  are  presented  which  state  that  bandlimited  functions 
of  a  certain  class  can  be  represented  by  a  single  projection. 

INTRODUCTION 

There  are  occasions  when  the  structure  of  a  three-dimensional  object  is 
unknown  and  desired  but  only  two-dimensional  projections  of  that  object  are 
available.  A  transmission  x-ray  photograph  might  represent  such  a  projection. 
A  single  x-ray  photograph,  since  it  is  merely  a  two-dimensional  representa¬ 
tion  of  an  Inherently  three-dimensional  structure,  does  not  contain  all  of 
the  information  that  a  physician  might  want,  since  all  detail  in  a  direction 
normal  to  the  photographic  plate  has  Keen  superimposed.  .One  solution  to 
this  dilemma  is  to  take  fc-ray  photographs  from  different  vantage  points  and 
use  this  set  of  x-rays  to  recreate  a  three-dimensional  image,  say  in  a 
digital  computer.  In  this  paper  we  will  consider  several  techniques  for  ac¬ 
complishing  this.  In  addition  to  performing  reconstruction  from  x-rays  [4] 

[5] [7] [8],  these  algorithms  are  useful  for  reconstructing  from  electron 
micrographs  [2] [3],  fan-beam  radio  telescope  scans  [1]  and  line  responses 
from  linear  shift-invariant  optical  systems. 

Because  these  algorithms  will  be  Implemented  on  a  digital  computer,  we 
are  constrained  to  work  with  sampled  data.  Our  input  projections  must  be 
sampled  and  our  output  reconstruction  will  consist  of  samples  of  the  unknown 
structure.  As  a  result  this  problem  'is  best  considered  as  an  inherently 
digital  one.  This  constrains  us  somewhat  by  limiting  the  class  of  signals 
that  can  be  reconstructed,  but  this  is  not  a  serious  concern  since  most  sig¬ 
nals  that  arise  in  practice  can  be  closely  approximated  by  signals  from  this 
restricted  class.  On  the  other  hand,  by  constraining  ourselves  to  this  one 
class  of  signals,  we  have  developed  some  extremely  powerful  algorithms  and 
interesting  results. 


THE  GENERAL  RECONSTRUCTION  ALGORITHM 

Up  to  this  point  we  have  assumed  that  projection  functions  are  simply  the 
mathematical  equivalents  of  x-ray  photographs  or  electron  micrographs.  To 
understand  the  problem  more  fully  we  must  make  that  definition  more  precise. 
Let  us  assume  that  one  unknown  signal  can  be  described  by  an  extinction 
function  f(x,y,z)  and  that  an  x-ray  photograph  is  made  of  the  unknown  by  a 
uniform,  collimated  x-ray  beam  with  intensity  IQ  which  propagates  parallel 
to  the  y-axis.  Then,  ignoring  scattering  effects,  the  observed  intensity 
variation  of  the  x-ray  photograph  can  be  described  by: 

I(x,z)  -  I0  exp{  -/  f (x,y,z)  dy)  (1) 

We  can  then  define  the  projection  function  associated  with  this  orientation 

by 


i 


i(x,*)  -  •  /  f(x,y,z)  dy 


(2) 


Th«  function  of  f(x,y,z)  might  measure  the  density  of  the  unknown  as  it  does 
to  some  extent  in  the  case  of  x-rays  or  it  might  measure  the  extent  of 
etaining  of  a  specimen  in  the  case  of  electron  micrographs,  etc. 

„ ,  Tu  gener*lize  e<I*  (2),  let  us  assume  that  each  projection  is  taken  normal 
to  the  z-axls  and  that  the  projection  plane,  the  u-z  plane,  meets  the  x-z 
plane  at  an  angle  0.  With  this  notation,  eq.  (2)  corresponds  to  the  projec¬ 
tion  at  angle  0-0  .  We  shall  define  the  projection  at  angle  0  by 

m 

p8(u,z)  -  /  f (ucosO  -  vslnO,  usinO  +  vcosO,  z)dv  (3) 

mrnm 

F(“x'5r»wz>  18  the  Courier  transform  of  f(x,y,z),  and  if  we  define  a 
plane,  the  m-mz  plane  to  intersect  the  mx-m2  plane  at  an  angle  ♦,  then  we 
can  further  define  the  slice  of  F(u,x,Wy,«z)  at  angle  ♦  to  be  F(u,cos*,«osin*, 

f  8llce  o£  F<“x»«y.«z>  corresponds  to  the  function  evaluated  on 
a  pla.ie,  which  Includes  the  ur  axis,  and  which  is  specified  by  a  single 
angular  parameter.  These  relationships  are  illustrated  in  Fig.  1. 

|heorem:  (projection/slice  theorem)  The  projection  of  f(x,y,z)  at  angle 

0  to  the  x-z  plane  is  the  Fourier  ttansform  of  the  slice  of  F(co,  w.  u  ) 
at  angle  0  to  the  mx-mz  plane.  ^«x»«y."*' 

Prjof:  Let  us  define  a  coordinate  system,  the  w.i.w  coordinate  system, 
which  is  a  rotation  by  0  (radians)  about  the  ut-axisf  of  the  mv,mv.m. 
Coordinate  system.  x  y*  * 


uxcos0  +  ursine 


«  “  WjfSinO  +  u  cos0 
Then  since  y 

m  m  m 

F(“x»“y*“z>  ■  /  /  /  f (x,y,z)  exp(-j[xMx+yUy+zMz])  dxdydz 

we  can  write 

.  —  —  — 

F(u,«,«x)  -  F(ux,Wy(uz)  -  /  /  /  f (x,y,z)  exp(-j [xwcosO  +  ymsinfl 


(4) 

(5) 


-  xuslnO  +  yucosO  +  zm  )]  dxdydz  .  (6) 


Since 


F(o)cos0, waine, u>2)  -  F(m,0,m  )  -  /  /  /  f(x,y,z)  exp(-J[xucos0 


+'  yHSlnO  +  zw  ])  dxdydz  (7) 
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(8) 


by  defining 

u  -  xcos0  +  yslnO 
v  ■  -xsln0  +  ycosO 
we  can  observe 

F(mcos0, usinO, m^)  ■  F(m,0,mz)  ■  /  /  f/  f (ucose-vslnO,  usine+vcose. 


* 

r)dv|exp{-J(uurfZM2)}  dudz  (9) 

mm  * 

f  /  Pe(u,z)  exp{-j(uw+ZM_)}  dudz  (10) 


Q.E.D 
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,  Projection/slice  theorem.  Prom  these  we  can* 

iV^rtt  21T|  Urier  8P*Ce  “d  then  by  lnverse  Fourier  transforming 

1  «cinst^?ii°n8H  W*  ^  “  e8timate  of  f(x,y,z),  or,  in  our  terminology, 

hed  L  IZ  ^  “Ulny  Pr°Jection8  8re  «®®ded,  how  tuey  should  be  sam- 

pled  and  how  the  Fourier  space  should  be  estimated  from  samples  of  the 
slices  depend  upon  the  individual  algorithms. 

SPECIFIC  RECONSTRUCTION  ALGORITHMS 

IhlieL^«Hr'wJ??t  the  functl?n*  t0  be  reconstructed  ar.  bandllmlted. 

«d  lJ  Im  thL  Ini”";"'  thM  *U  °f  the  •"  bandllmlced 

f  U?^  to  COmPute  Fourier  transforms  from  samples  of  the 

l0!f  of.lnfona8tion.  8  necessity  for  performing  a  recon- 
*^  i°n  digitexly.  Also  for  our  arguments  let  us  assume  that  we  are 

siiiOl  iroWMiirCtTO/Tdi”?8i?nal  function  (picture)  from  one-dimen- 
lalHi  of  z^  equivalent  to  reconstructing  for  only  a  single 

kn^n^rrf0?  *!  1186  °bt8Jn  88mple8  °f  the  Fourler  transform  of  an  un- 
own  picture  is  to  sample  earh  projection  at  the  same  sampling  rate  that 

If  fhen  eichre!ter  th*V“  “8xlmu»  Nyquist  rate  of  each  of  thf  projections, 
rirhi  ^  8e?uence  of  samples  is  Fourier  transformed  using  a  DFT  algo- 
a  Fourl®r  transform  of  the  unknown  picture  f(x,y)  will  be  known  on 

*  PthP  taster  of  points.  If  the  projections  were  evenly  spaced  from  0  to 

F(«  !  )*Jh«e°Lr  t*  tHr  8h0Wn  in  Fig*  2(8>*  Froa  the8e  samples  of 
nn»x«fyL  h  *  number  of  techniques  that  can  be  used  to  estimate  f(x,y). 

*  *11*  8Ufce88ful  5hat  have  used  is  to  use  linear  interpoLt^n 

nol  6  “  p  He  p8lues,  ofrF(“x»“y)  on  a  Cartesian  (square)  raster  fJom  the 
polar  samples.  From  the  Cartesian  raster  we  can  perform  an  inverse  two- 

dimensional  DFT  to  obtain  our  estimate.  Another  approach  that  has  yielded 

in teeral°in tru®tlons  f*°m  8  Pol»*  «ster  is  to  write  the  inverse  transform 
integral  in  polar  coordinates, 

!  -  *  « 

F(x»y)  “  1  t  F(«,8)  exp{ j (xucosB  +  yusin6))u  dude  (11) 

0  ”T 

•nd  then  approximate  (11)  by  .  sum,  where  each  summand  depend,  upon  on.  of 
th.  polar  samples .  The  latter  technique  performs  well  resolving  detail  but 
p“’ cly  “  *re"  °f  l0“  Information  content,  such  ..  background,  or 
*”*!  Ill'll  c°n,t,nJ  8rey  level.  The  Interpolation  technique,  on  the 

not  extract  d^ll^U  op'”’*lt‘-  U  ««  area,  but  doe. 

As  an  alternative  to  sampling  each  projection  at  the  same  sampling  rate 
we  can  vary  the  sampling  rate  with  the  projection  angle.  In  parSicula^ 

IuJrth2‘^  <X’y^  f8b“dll“it?d  wlthlna  8{>U8W  in  the  frequency  domain 

prole«l“n  1.  S.rfl:?l>  "  ^  th‘  b“d“ld'h  •*  «- 


ix{ | cos6  | ,  I  Sine  I ) 


-C 

i  C  «•  ' 


t  ,  »  •« 


How  suppose  we  sample  each  projection  at  a  sampling  rate  which  is  propor¬ 
tional  to  its  bandwidth.  If  these  sequences  are  then  DFT'd,  F(w_,wy)  will 
be  known  on  a  concentric  square  raster  such  as  that  in  Fig.  2(b),  At  first 
glance  such  a  set  of  points  is  not  to  be  favored  over  the  polar  set,  but 
such  is  not  the  case.  Using  both  the  interpolation  algorithm  and  the 
integral  approximation  algorithm,  we  get  better  reconatructlons  from  the 
concentric  squares  raster  than  from  the  polar  one.  One  reason  for  this  can 
k®  seen  from  the  fact  that  the  Cartesian  raster  of  samples  which  we  must 
have  in  order  to  perform  an  Inverse  DFT  lie  along  the  same  horizontal  and 
vertical  lines  as  the  sides  of  the  concentric  squares.  Thus  instead  of 
the  necessity  of  performing  a  two-dimensional  linear  Interpolation,  we  only 
n*®d  to  perform  a  one-dimensional  one.  Thia  might  be  expected  to  produce 
less  error. 

The  second  advantage  to  a  concentric  squares  raster  becomes  apparent 
when  we  consider  a  special  class  of  signals.  We  have  assumed  that  f(x,y) 
is  bandlimited.  Let  us  now  assume  that  it  can  be  represented  in  the  form 


N-l  N-l 

-  l  l  f(?7> 

m-C  n-0  w  »w 


ainw(x-  slnw(y-  ~) 

«*<*-  ->  (y-  f> 


(13) 


Thus  in  addition  to  requiring  that  f(x,y)  be  bandlimited,  we  have  required 
that  when  sampled  at  its  Nyquist  rate  it  have  only  a  finite  number  of  non¬ 
zero  samples.  The  number  N  which  represents  the  width  of  the  square  of 
non-zero  samples  we  will  refer  to  as  the  order  of  f(x,y).  Bandlimited 
functions  of  finite  order  are  completely  specified  by  their  DFT's  and  their 
Fourier  transforms  are  two-dimensional  polynomials  of  degree  N-l  in  each 
variable.  Because  of  the  fact  that  a  one-dimensional  polynomial  of  degree 
d  is  completely  specified  jy  d+1  independent  samples,  it  can  be  proven  that 
a  bandlimited  function  of  order  N  can  be  reconstructed  exactly  from  N+l 
projections  (although  not  necessarily  for  all  seta  of  N+l  projections). 

Thus  for  this  class  of  functions,  concentric  square  projections  as  these 
projections  shall  be  called,  assume  theoretical  importance. 

In  Fig.  3  is  shown  a  reconstruction  of  a  cross  section  of  a  leg  bone 
from  36  concentric  squares  projections.  The  projections  were  sections  of 
x-rays  which  were  taken  at  5°  intervals  which  are  scanned  logarithmically. 
Each  section  of  each  x-ray  consisted  of  256  samples  and  the  reconstruction 
is  plotted  on  a  256x256  Cartesian  raster. 


RECONSTRUCTING  FROM  ONE  PROJECTION 

In  the  previous  section  we  noticed  that  bandlimited  functions  of  finite 
order  could  be  reconstructed  exactly  from  N+l  (concentric  squares)  projec¬ 
tions,  where  N  was  the  order  of  the  function.  *  Is  this  the  minimum  number 
of  projections  needed  to  reconstruct  these  functions?  The  answer  to  this 
question  is  no.  In  fact,  functions  of  this  form  can  be  reconstructed  from 
a  single  projection. 

Theorem:  (one  projection  theorem)  A  bandlimited  function  f(*.,y)  of 
order  N  in  each  variable  can  be  reconstructed  from  the  concentric 
squares  projection  at  6  -  tan-1(l/N). 

Proof:  Consider  the  slice  at  6  -  tan”^(l/N), 

F(ucos8,wsin6)  -  F  (  - — —  ,  - - -  )  Q4) 

+  1  /N2  +  1 


where 


F(«  ,u  ) 
x  y 

N-l  N-l 

-  1  l  ftSL'SL)  exp(-j  [ 

m-0  n-0  W  W 

^(mwx+nuy) ) ) 

where 

6.-.(w  »«  )  m  1 

'  i .  K 1 1  w,  i 

“y  1  I  W 

WW  x*  y' 

(o  ,  otherwise 

b  (u>  ,u)  ) 

ww  x*  y 


Evaluating  (14), 


n  — aH- ,  — — ) 

/P+i 


Y  Y  *  Ir*  exp(-j  <— ==  iN«+nj)} 

m-0  n-0  w  w  W»/P+I 


(15) 


(16) 


|<d  <  ^  *fri+l 

■*  N 


If  we  define  g(Nm+n)  -  f(— ,  — ) ,  then  (16)  becomes 

W  W 


F( 


uN 


u 


H-l 


/N^+l  ^jJ+I 


)  - 


l  g(p)  exp  (- j  )  ,  |u|  <_J  /H2+1 

p-0  W*^+T  " 


(17) 


otherwise 


Thus  over  the  region  of  interest,  the  slice  at  0  -  tan~^(l/N)  (which  can 
be  obtained  from  one  projection)  is  a  one-dimensional  polynomial  oi  degree 
N2-l  in  the  variable  exp{-J  (iru/W'H2+l) ) ,  and  the  coefficients  of  that 
polynomial  are  simply  the  picture  samples  arranged  column  by  column.  Thus 
knowledge  of  N2  sample  values  specifies  the  picture  samples  and  by  (13) 
this  specifies  the  unknown  picture  function. 

Q.E.D. 


This  theorem  also  has  implications  in  two-dimensional  filter  derxgn  for 
it  implies  that  the  frequency  response  of  a  two-dimensional  non-recurslve 
digital  filter  is  specified  by  its  values  along  a  single  radial  line, 
although  how  this  fact  might  be  utilized  in  filter  design  is  still  not 
clearly  understood. 

Despite  the  beauty  of  the  one-projection  theorem,  it  is  not  particularly 
useful  as  a  reconstruction  technique  on  a  finite  precision  machine,  because 
for  values  of  N  larger  than  4  or  5,  solving  eq.  (17)  for  (f(mir/W,  nir/W)} 
requires  the  inversion  of  a  large  nearly  singular  matrix  and  any  errors 
made  in  obtaining  the  slice  samples,  which  are  bound  to  occur,  are  amplified 
enormously.  In  this  respect,  applying  the  one  projection  theorem  is  much 
like  trying  to  apply  analytic  continuation  to  an  unknown  function  all  of 
whose  derivations  are  known  at  a  single  point. 

From  a  theoretical  viewpoint,  it  is  interesting  to  ask  if  there  are  any 
other  of  these  critical  slices.  Clearly  not  all  slices  will  work;  for  ex¬ 
ample,  knowledge  of  sample  values  along  just  the  ux  or  uy  axis  is  generally 
not  sufficient.  In  fact,  there  are  an  infinite  number  of  these  critical 
slices.  A  necessary  and  sufficient  condition  for  a  slice  with  a  rational 
slope  to  be  a  critical  slice  is  given  in  the  following  theorem  which  will 
be  stated  without  proof. 


Theorem;  Tf  a  slice  has  an  angle  tan"l(A/B)  where  A  and  B  are 
integers  with  no  common  factor,  then  a  necessary  and  sufficient 
condition  for  this  slice  to  be  sufficient  for  reconstruction  of 
a  bandllmited  function  of  order  N  is  that  the  equation 

0  <  m,n,m'  ,n'  <_  N-l 


Bm  +  An  •  Bm'  +  An* 


possess  only  the  trivisl  solution  m  -  m' 

n  ■  n' . 

In  particular,  if  N  is  a  power  of  two  and  A  is  an  odd  integer,  the 
slice  with  slope  tan-1(A/N)  is  critical. 

Using  techniques  completely  analogous  to  that  by  which  we  derived  the 
one  projection  theorem,  we  can  derive  two  projection  theorems,  four  pro- 
jectlon  theorems,  etc.  As  a  rule  of  thumb,  the  greater  the  number  of  pro¬ 
jections  which  we  care  to  use,  the  easier  it  is  to  obtain  a  solution  with 
real  data.  It  generally  takes  about  N/2  projections  for  the  sensitivity  to 
be  reduced  enough  to  use  currently  available  machines. 

SUMMARY 

A  number  of  algorithms  have  been  presented  for  reconstructing  multi¬ 
dimensional  signals  from  their  projections,  such  as  x-rays.  In  addition  to 
their  uses  in  performing  reconstructions,  projection  functions  are  poten- 
ti4lly  useful  for  characterizing  multidimensional  signals  for  purposes  of 
pittem  recognition  or  bandwidth  compression  for  signal  transmission.  Uses 
‘for  projections  in  these  areas  has  not  been  explored. 
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THE  BLACK  BOX 


being  a 

small,  fast,  inexpensive 
digital  processor 
desigrted  mainly  for 
speech  synthesis 
but  nicely  suited 
for 

myriad  other  tasks 

J.  Allen 
F.x.  Carroll 
E.  Jensen 


This  document  contains  a  detailed  description 
of  the  proposed  Black  Box.  Readers  are  asked 
to  comment  freely  and  quickly  so  we  can  proceed 
with  construction. 


The  Black  Box 


1.  Introduction.  In  this  paper,  we  will  describe  a  new  computer 
having  several  unusual  design  features.  The  original  motivation 
for  this  design  was  the  need  for  a  real-time  all-digital  speech 
synthesizer.  Since  the  vocal  tract  model  to  be  simulated  is 
complicated  (see  Fig.  1,  due  to  D.H.  Klatt) ,  it  was  necessary 
to  adopt  several  features  which  optimize  these  calculations  in 
time.  The  heart  of  the  computer  is  a  very  fast  multiplier 
(18  X  18  in  about  100  nsec).  No  instruction  overlap  is  used, 
but  instructions  have  a  three-address  format,  so  that  (for  example) 

A-B  •*  C  is  done  in  one  instruction,  including  the  two  loads  and 
one  deposit.  Each  instruction  contains  an  op-code  plus  these 
three  addresses,  as  well  as  a  jump  address  for  the  next  instruc¬ 
tion,  so  that  no  program  counter  is  needed.  There  is  a  separate  instruction 
memory  of  44-bit  width,  and  two  data  memories,  each  of  18-bit 
width.  It  should  be  noted  that  no  special  registers  are  provided. 

There  is  no  accumulator,  no  program  counter,  and  the  machine 
status  word,  as  well  as  the  direct  memory  access  word  count  and 
addresses  are  kept  in  memory,  so  that  they  can  always  be  inspected 
by  the  program.  Very  little  timing  is  needed  internal  to  an 
instruction,  since  the  multiplier  is  combinatorial,  ind  shifting 
is  accomplished  via  multiplexing.  This  results  in  a  fairly  simple 
control  structure,  most  of  the  complication  arising  from  I/O 
considerations.  The  computer  is  designed  to  be  interfaced  to  a 
host  PDP-9  machine,  and  will  probably  not  be  used  in  a  stand-alone 
mode,  although  this  is  possible.  Direct  memory  access  transfers 
are  possible  in  either  direction  between  the  PDP-9  and  any  memory 
location  in  the  black  box  at  a  1  megacycle  rate.  In  addition, 
transfers  to  and  from  the  PDP-9  accumulator  and  any  black  box 
location  are  possible.  PDP-9  IOT  instructions  can  be  used  to 
control  various  control  bits  in  the  black  box.  Finally,  part  of 
the  data  memorv  is  paralleled,  so  that  it  is  possible  to  compute 
using  one  parameter  set,  while  a  new  set  is  being  loaded  (trans¬ 
parently  to  the  blacx  box  program)  from  the  PDP-9.  In  this  paper, 
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we  present  several  examples  to  show  the  utility  of  this  device 
for  other  specialized  tasks,  including  display  processing, 
floating  point  calculations,  and  FFT's.  The  basic  design  goal 
has  been  to  achieve  a  powerful,  fast,  yet  inexpensive  processor, 
but  with  little  consideration  given  to  ease  of  programming. 

2.  Architecture.  Figure  2  shows  a  rough  block  diagram  of  the 
computer.  Three  components  can  be  recognized. 

Instruction  processor:  Includes  a  256  X  44-bit 
instruction  memoty,  loadable  directly  from  the  host  machine; 
an  instruction  register;  instruction  addressing;  skip  logic; 
and  instruction  decoding. 

2)  Data  processor:  Includes 

a)  Two  data  memories,  X  +  Y,  each  256  X  18,  and 
each  having  some  of  its  locations  paralled  by 
additional  memory  locations. 

b)  Three  processing  units: 

!•  Multiply  unit,  performs  X.Y+Z 

2.  Arithmetic-Logical  Unit  (ALU),  performs 
adds,  subtracts,  and  logical  operations. 

3.  DIT  test,  performs  skip  on  a  selected 
bit  of  a  given  word  and  provides  for 
modification  of  that  bit. 

c)  Data  Select:  Selects  desired  output  from  pro¬ 
cessing  unit,  with  capability  to  incorporate  shifts. 

3)  Input-Output  Processor;  provides  programmed  and  DMA 
transfers  with  the  host  computer. 

--3-U'rG  3.  shows  a  more  detailed  block  diagram,  which  is 
intended  to  relate  to  the  further  detailed  discussion. 

3.  Instruction  processor:  Instructions  are  44  bits  long,  and 
cannot  be  modified  within  the  black  box  except  that  the  effective 
address  of  all  address  fiolds  can  be  modified  by  index  register?, 
or  an  instruction  may  be  skipped.  The  host  computer,  hewover, 
has  access  to  the  256-location  instruction  memory  so  that 
instructions  may  be  updated  at  any  time 
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by  input  transfers  into  the  black  box.  There  is  no  program  counter , 
so  that  each  instruction  contains  a  jump  address  to  the  next  instruc¬ 
tion.  This  jump  address  can  be  modified  by  indexing  or  by  ORinq  into 
its  LSB  the  OR  of  the  following  3  conditions: 

1)  Data  Select  output  ■  zero 

2)  "  "  "  negative 

3)  one  bit  selected  by  BIT  instruction. 


There  are  five  classes  of  instructions, 
block  of  features:  ^ - 

4-  .  2-  .  B  1-8  8- 


All  five  have  a  common 


i 

!! 

B 

mm 

EM 

V 

err 

D 

T 

The  only  variations  in  this  format  over  the  entire  group  of 
instructions  are  the  length  of  the  X-address  field  (8  or  9  bits) 
and  the  specifics  of  the  "Y  or  special"  field,  which  is  either  the 
Y-address  (in  double-source  instructions)  or  the  specifier  of 
actions  peculiar  to  certain  single-source  instructions.  We  first 
describe  the  common  features  (i.e.  everything  but  the  "Y  or  special" 
field)  : 


l) 

Index  control: 

There  are 

4  index  registers  located 

the 

X  memory 

X-Address 

Register 

Action 

110 

x„ 

8 

Y 

modifies  Y-address 

128 

xx 

"  X-  " 

l38 

XD 

"  D-  " 

148 

XJ 

"  Jump  " 

The  4  index  bits  control  the  ORing  of  these  registers  into 
their  respective  address  registers.  The  use  of  ORing,  as 
opposed  to  adding,  simplifies  the  logic  and  increases  the 
speed  of  indexing  at  very  little  cost  in  programming  conveni¬ 
ence.  In  single-argument  and  BIT  instructions,  the  Y  index 
bit  is  ignored. 


i 


-4- 


2)  Skip  bits:  There  are  always  two  possible  skips  associated 
with  each  instruction. 

N  skip  on  data  select  output  negative 
Z  skip  "  "  "  "  ■  zero 

Placing  l's  in  these  fields  enables  the  skips ,  which  cause 
a  1  to  be  ORed  into  the  jump  address  LSB  if  the  designated 
skip  condition  is  met. 

3)  Op  Code:  The  op  code  is  5  bits  long.  Detailed  explan¬ 
ations  are  given  below. 

4)  X-address.  In  double-source  instructions,  this  is  an 
8-bit  address  in  the  X-memory.  In  single-source  instruc¬ 
tions,  the  9  bits  designate  a  location  in  X  or  Y  memory. 

If  the  MSB  is  0,  the  X-memory  is  addressed,  and  if  it  is  a 
1,  the  Y-memory  is  addressed. 

5)  D-address:  This  is  the  9  bit  destination  address, 
i.e.  where  the  result  of  an  instruction  is  stored.  When 
the  memory  is  in  the  serial  mode,  the  result  can  be  stored 
in  any  X-  or  Y-location,  but  when  the  memory  is  in  parallel 
mode,  the  result  is  stored  in  the  analagcus  locations, 
specified  by  the  8  least  significant  bits,  in  both  memories. 

4 

This  last  situation  is  violated  in  input  transfers,  discussed 
below. 


6*  Jump  Address:  8-bit  address  of  next  instruction. 


This  block  of  features  is  augmented  in  different  ways  to  form 
the  five  instruction  classes: 

I)  Double-Source  Instructions :  The  X  and  Y  fields  are  both 
8  bits  long,  and  each  specify  a  source  loco4 '•  on  in  their 


respective  memories.  See  Figure  4.  The  result  is 
in  D,  and  the  next  instruction  token  from  J. 


stored 


II)  Muljtijvly:  X  and  Y  are  8-bit  source  addresses,  and  a 
third  source  is  always  implied,  namely  the  Z-rccjister,  which 


is  location  10g  in  the  X-memory.  See  Figure  5.  Note  that 
Z  is  always  cleared  before  a  deposit  is  made,  so  that  it 
will  either  be  properly  updated  when  D  ■  Z,  or  set  to  zero 
to  proviue  only  the  X*Y  function.  The  op-code  contains 
a  3-bit  scaling  field,  and  the  8  possible  options  are  shown 
in  Figure  5. 

III)  Index  Register  Modification  Instructions:  These 
look  like  an  ADD  instruction  in  that  the  contents  of  the 
two  source  addresses  are  added  and  stored  in  the  destina¬ 
tion  address,  except  that  the  result  is  also  stored  in  the 
index  register  referred  to  by  the  instruction.  See  Figure  6. 
There  are  4  index  instructions,  one  for  each  index  register. 

IV)  BIT  instruction;  The  X-field  is  9  bits  long  so  that 
any  X  or  Y  location  can  be  used  as  source.  The  remaining 
(special)  field  of  7  bits  is  divided  into  a  5-bit  select 
field  and  a  2-bit  modification  field.  See  Figure  7.  The 
select  field  specifies  which  of  the  18  bits  of  a  data  word 
is  to  be  examined.  If  this  bit  is  a  1,  it  is  ORed  into  the 
LSB  of  the  jump  address.  The  modification  field  can  then 
affect  this  bit  in  any  of  4  ways,  shown  in  the  figure. 
Additionally,  after  this  modification  is  made,  the  usual 
skip  tests  can  be  performed  on  the  resultant  data  word. 

V)  Single-Source  Instructions:  Here,  the  X-address  field 
is  9  bits  long,  and  the  remaining  7  bits  is  used  to  specify 
shifts  and  rotates  as  shown  in  Figure  8.  Thus  both  source 
and  destination  can  be  any  address  in  X  or  Y. 

The  instruction  processor  has  the  following  paths  to  the 
rest  of  the  machine: 

a)  Index  bits:  cause  index  registers  to  be  ORed 

into  the  corresponding  address  registers 


b)  Skip  bits:  cause  outputs  from  S/:ip  tests  to  be 

ORed  into  the  jump  address 

c)  Op  code  including  "special"  field  when  present: 

goes  to  instruction  decoder,  then  to 
instruction  execution  control  in  the 
data  processor. 

d)  Source  and  destination  addresses:  routed  to 

respective  data  memory  address  registers 

e)  Jump  address:  routed  to  instruction  address 

register. 


Addressing  Figure  9  shows  the  complete  address  soace,  both  as 
internally  and  by  the  host  computer.  Note  that 

1)  To  the  host  machine 

a)  the  X  +  Y  memories  are  one  512-register  memory, 
locations  0-777 

b)  each  instruction  memory  location  is  mapped  onto 
4  host  computer  18-bit  words,  so  that  4X256  *  1024 
host  memory  words  are  needed  to  represent  the  instruc¬ 
tion  memory.  These  are  locations  2000-3777  of  the 
overall  address  space. 

2)  The  top  40g  locations  of  the  Y  memory  can  be  switched 
between  two  separate  physical  memories.  More  on  this  below. 

3)  The  bottom  15g  locations  of  the  X  memory  are  special 
registers,  which  can  be  accessed  directly  in  addition  to 
the  normal  X-addrossing. 


5.  Data  Processor:  As  mentioned  above ,  there  are  3  units  in  the 
data  processor: 

a)  X  +  Y  data  memories :  These  are  256  X  18-bit  memories, 
which  can  run  in  one  of  2  modes  selected  by  a  control  bit 
in  Word  O  of  the  X-memory. 

1)  Serial :  This  mode  logically  places  the  X  and  Y 
memories  end-to-end.  X  is  the  bottom  256  locations, 
and  Y  the  top  256  locations.  Thus,  in  this  mode,  all 
deposits  and  single-source-instruction  loads  have  access 
to  all  512  locations.  Additionally,  the  I/O  instruc¬ 
tion  (which  is  a  single-source  instruction)  can  access 
all  512  locations  in  this  mode.  Of  course,  double¬ 
source  instructions  access  both  X  +  Y  separately  and 
simultaneously. 

2)  Parallel;  Source  loads  are  unaffected  by  this 
mode,  but  deposits  go  to  analogous  locations  in  both 
X  +  Y  memories.  In  the  following  diagram,  the  case 
of  the  shadow  memory  in  parallel  mode  is  shown.  When 
the  shadow  mode  and  parallel  mode  are  both  selected, 
input  and  output  transfers  within  th°  shadow  memory 
range  go  to/from  that  memory  which  is  selected 


as  the  I/O  shadow  memory  at  the  time  of  the  transfer, 
and  no-where  else.  It  in  as  though  the  other  part  of 
the  parallel  transfer  went  to  a  (non-existent)  X 
shadow.  Thus,  in  general,  even  in  the  parallel  mode, 
the  contents  of  X  &  Y  within  the  shadow  range  will  not 
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agree,  and  this  is  desirable.  Switching  of  the  shadow 
memories  only  effects  which  of  the  two  is  currently 
local  to  the  black  box  Y  memory,  and  which  is  used 
for  I/O.  (See  below) 

The  X  memory  has  its  bottom  15g  registers  used  for  special 
purposes,  as  shown  in  Figure  10.  These  registers  can  be  used  just 
like  any  other  memory  register,  but  in  addition,  registers  4-14 
have  special  direct  output  lines,  and  registers  0-3  have  direct 
input  and  output  lines  to  the  host  computer.  Additionally,  loca¬ 
tions  1,  2,  3,  and  7  are  counters.  All  of  these  special  registers 
are  treated  in  detail  below.  Word  0  is  the  control  register,  and 
is  also  given  special  treatment. 

The  Y  memory  has  its  top  321Q  registers  duplicated  by  a 
"shadow"  memory.  (see  Figure  3)  when  the  shadow  mode  is  off,  the 
V  memory  is  a  straightforward  256  X  18  memory.  But  when  the  shadow 
mode  is  on,  a  bit  in  the  control  register  selects  which  of  the  two 
memories  will  accommodate  local  loads  and  deposits  within  this 
address  range.  I/O  with  the  host  computer,  however,  will  utilize 
that  memory  not  so  selected,  and  this  is  done  automatically.  This 
feature  was  provided  to  allow  for  a  parameter  memory,  one  part  of 

which  could  be  used  for  local  computing,  while  the  other  is  being 
updated  by  the  host  computer. 

b)  Processing  Units;  Within  the  data  processor,  there 
are  three  processing  units,  specialized  by  function. 

1)  Multiply  As  described  in  Figure  5,  this  unit 
performs  X*Y+Z.  z  is  supplied  by  register  10  of  the 
X  memory,  and  is  always  used  in  the  calculation.  An 
18X18  array  multiplier  perform  full  2's  complement 
multiplication,  together  with  the  z  add,  in  about 

100  nsec.  All  scalings  are  done  in  the  data  select 
unit. 
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2)  Arithmetic-Logical  Unit  (ALU)  This  is  a  processor 
accepting  X  t  Y  inputs  and  performing  adds,  subtracts, 
and  logical  operations,  it  is  realized  in  74S181's 
and  is  used  for  all  instructions  except  multiply. 


3)  — i-t..  test  (BIT)  This  unit  selects  one  among  the 

18  bits  of  a  data  word,  and  connects  it  to  the  jump 
address  skip  logic.  Thus,  the  selected  bit  is  Ored 
into  the  LSB  of  the  jump  address.  The  selected  bit 
can  also  bo  modified,  as  shown  in  Figure  7,  and  this 
is  accomplished  in  the  ALU. 


c)  Data Select:  This  unit  is  a  large  multiplexor,  capable 

of  gating  any  one  of  -  18-bit  words  through  the  unit. 
These  words  include  input  transfers,  multiplier  scalings, 
and  ALU  shifts  and  rotates.  Note  that  all  shifting  is  done 

by  multiplexing,  so  that  no  shift  register  or  counter  is 
required. 


Skip  tests  for  negative  and  zero  are  made  at  the  output  of 
the  data  selector,  and  all  results  are  held  in  a  latch, 

while  the  destination  address  is  set  up,  before  they  are 
stored. 


6'  ?-nPut/Output  Processor:  This  processor  handles  all  i/o  trans¬ 
fers  with  the  host  computer.  As  shown  in  Figure  9,  the  host 
machine  has  access  to  all  memory  locations  in  the  black  box.  The 
I/O  processor  accomplishes  each  transfer  by  executing  a  one-instruc¬ 
tion  interrupt  (see  below),  which  is  transparent  to  the  currently 
running  program.  This  is  the  only  interrupt  facility  in  the  black 

box.  Transfers  may  take  place  in  either  direction,  and  are  of  two 
types. 

a>  AC_transfers.  when  the  POP-9  is  the  host,  transfers  to 
and  from  its  AC  can  bo  made  undor  host  I/O  control.  Thus 
those  transfers  can  only  bo  originated  from  the  host  machine. 


< 
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b)  DMA  transfers.  DMA  transfers  (up  to  a  5  megacycle  rate) 
can  be  initiated  by  either  machine.  The  required  special 
registers  are  in  the  X-memory. 

X-Location  Use 

1  Host  DMA  address 

2  Local  DMA  address 

3  Word  count 

Note  that  all  three  of  these  registers  are  counters. 

The  details  of  these  transfers  are  as  follows:  (IOT  refers  to  host 
I/O  control  instructions) . 

a)  AC  transfers  (LOC  refers  to  an  arbitrary  black  box  data 
location) 

1.  PDP-9  AC  — *  DB  LOC 

a)  IOT  puts  BB  address  into  BB  X 2,  via  input 
buffer  register  (IBR) 

b)  IOT  puts  data  in  IBR,  and  causes  interrupt  to 
transfer  data  to  LOC 

2.  BB  LOC  -*  PDP-9  AC 

a)  IOT  puts  BB  address  into  BB  X2,  and  then 
causes  C(X2)  to  be  loaded  into  Output  Buffer 
Register  (OBR) 

b)  IOT  calls  for  data  from  OBR  — *  AC 

b)  DMA  transfers 

1.  PDP-9  DMA  to  BB 

a)  IOT  host  starting  address  to  X^ 

b)  IOT  black  box  starting  address  to  X2 

c)  IOT  word  count  to  X3 

d)  IOT  to  initiate  transfer:  direction  of  transfer 
and  shadow  mode  control  arc  also  specified. 

2.  3B  to  PDP-9  DMA 

This  is  similar  to  the  above,  except  for  direction 
of  transfer. 

3.  DMA  transfers  can  also  be  initiated  directly  within 
the  black  box,  since  all  those  special  registers, 
and  the  DMA  control  flags  are  within  the  X-memory. 
Obviously,  these  registers  can  be  inspected  at  any 
time. 
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The  I/O  instruction,  executed  during  the  I/O  interrupt,  is  a 
somewhat  specialized  MOVE  (See  Figure  8).  Thus  it  has  9-bit 
source  and  destination  addresses. 

e)  jiftRu t  LOC  is  in  the  DMA  BB  Address  register  (X  memory 
word  2);  data  select  from  Input  Buffer  register  is  enabled, 
and  a  MOVE  takes  place  from  LOC  to  LOC.  Thus  the  initial 
contents  of  LOC  are  placed  on  the  X-bus,  but  the  outputs 
of  the  data  processors  are  not  gated  through  data  select. 

In  this  way,  the  initial  contents  of  LOC  are  lost,  and  the 
Input  transfer  only  uses  the  "store"  part  of  the  MOVE 
instruction. 

b)  output  LOC  is  in  the  DMA  BB  Address  register.  A  MOVE 
from  LOC  to  LOC  is  executed,  and  the  output  of  the  data  selector 
(the  contents  of  LOC)  is  also  placed  in  the  Output  Buffer 
Register  (OBR) . 

7*  Registers.  As  previously  mentioned,  the  bottom  150 

locations  of  the  X  data  memory  are  special,  in  that  they  are  more 
than  plain  memory  locations.  Some  have  special  input  and/or  output 
lines,  and  may  oven  be  counters.  The  several  subsets  of  these 
registers  require  special  discussion. 

a)  11“1£8.  These  are  the  4  index  registers.  The  index 
instructions  (Figure  6)  store  into  them  as  well  as  the 
designated  destination  address.  The  outputs  of  these 
registers  are  ORed  into  their  respective  address  fields  if 
gated  by  l's  in  the  respective  index  control  bits  of  the 
instruction  being  executed.  These  registers  are  not  counters. 
All  index  incrementing  must  be  performed  explicitly. 

b)  10g.  This  is  the  Z-register.  Its  output  lines  are 
permanently  connected  to  the  multiplier,  so  that  a  multiply 
instruction  always  adds  the  contents  of  Z  to  X*Y  to  form 
the  output  of  the  multiplier. 

c)  6^7.  These  arc  the  Clock  and  Count  registers. 
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respectively.  Normally  the  clock  interval  is  in  the  clock 
register.  This  count  is  automatically  jammed  into  the  count 
register  when  the  latter  goes  to  0  from  -1.  This  condition 
is  equivalent  to  the  transition  to  0  being  coupled  with  a 
carry  out  of  the  MSB.  Thus  no  jamming  takes  place  when  0  is 
deposited  into  the  count  register.  Each  time  the  desired  inter¬ 
val  is  jammed  into  the  count  register,  a  pulse  is  generated 
for  external  use.  The  count  register  is  also  constructed 
so  that  it  cannot  count  beyond  0.  If  a  periodic  output 
pulse  is  desired,  merely  set  the  desired  period  in  the  clock 
register,  and  the  count  register  will  generate  output  pulses 
at  the  period  interval  repeatedly.  An  external  clock  supplies 
the  pulse  train  for  the  count  register.  Any  existing  .count 
can  be  overridden  at  any  time  by  simple  depositing  into  the 
count  register.  This  scheme,  of  course,  also  provides  for 
changes  from  one  period  to  another,  by  merely  updating  the 
clock  register  at  the  appropriate  time,  i.e.  before  the 
current  period  expires.  Also,  it  is  clear  that  no  special 
instructions  are  needed  for  these  operations. 

d)  4-5  Those  are  the  Output  Registers.  Their  outputs 
are  brought  directly  (buffered,  of  course)  to  rear-panel 
connectors.  The  intent  is  to  connect  D/A  converts  to  these 
outputs. 

e)  1-3  These  are  the  DMA  address  and  word  registers. 

Location  1  contains  the  host  computer  current  address, 
location  2  the  local  (black  box)  current  address,  and  loca¬ 
tion  3  the  present  word  count.  This  last  location  generates 
a  pulse  on  the  transition  from  -1  to  0  for  use  in  supplying 
an  interrupt  to  the  host  computer.  All  three  registers 

are  counters. 
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f)  0  This  is  the  Control  Register.  Its  individual  bits 
are  employed  as  follows:  (no  significance  is  attached  to 
the  order.) 

1.  (1  bit)  Run/Halt.  DMA  will  still  transfer  during 
Halt. 

2.  (1)  Link.  This  flag  is  set  on  carry  out  of  the 
MSB  during  ADD  or  SUB.  There  are  corresponding 
instructions  which  provide  ADD  with  Link  and  SUB 
with  Link. 

Overflow.  This  flag  is  set  on  overflow  during 
ADD  and  SUB  (and  possibly  on  some  multiplier  outputs). 

It  can  be  reset  only  by  program  control  (using  BIT) . 

4.  (2)  Shadow .  One  bit  turns  the  shadow  mode  on,  so 

that  I/O  transfers  use  the  memory  not  selecte'd,  and 
the  other  bit  selects  which  physical  shadow  memory 
is  used  by  the  BB  for  local  computation. 

5*  ^  Series/Parallel,  in  the  serial  mode,  deposits 

are  made  to  one  of  512  locations  in  X  or  Y,  whereas 
in  parallel  mode  deposits  go  to  two  locations,  one 
in  each  of  X  and  Y.  (Note  comments  above  concern¬ 
ing  I/O  in  parallel  mode  within  the  shadow  address 
range) . 

6.  (3)  DMA  There  are  3  control  bits 

a)  Start.  Resets  when  transfer  is  done,  and 

initiates  interrupt  (when  enabled) 
to  the  host  computer  when  done. 

b)  Direction. 

Into  or  out  of  the  Black  Box, 

c)  Interrupt  Enable. 

Enable  flag  allows  Black  box  to 
interrupt  host  computer  when  DMA 
transfer  is  done. 

(1)  PA.9_9.J-8  Thia  is  an  enable  flag,  which  allows  the 

counter-register  0  condition  to  interrupt  the 
host  computer. 

J 
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8.  (l)  Program  Interrupt.  This  is  a  general  flag 
which  can  be  used  to  interrupt  the  host  computer 
for  any  reason. 

9.  (i)  Program  Interrupt  Enable.  Provides  overall 
control  of  all  interrupts  from  the  Black  Box  to 
the  host  computer. 

Items  1-9  above  account  for  12  bits,  so  there  are  still  6  bits  left 
which  can  be  used  for  flags  or  other  purposes. 

8.  Front  panel:  The  front  panel  will  have  two  sets  of  switches 
and  corresponding  lights.  One  set  is  18  bits  long  for  data  or 
instructions.  The  other  set  is  12  bits  long  and  specifics  the 
address  for  a  location.  Coupled  with  DEPOSIT  and  EXAMINE  keys,  it 
is  thus  possible  to  examine  and  change  any  location  in  any  memory 
while  the  computer  is  halted.  This  process  will,  however,  destroy 
the  previous  contents  of  register  2  of  the  X-memory  (Black  Box 
DMA  address) ,  since  this  location  is  used  to  specify  the  black  box 

address  of  all  I/O  transfers. 

In  addition  to  the  DEPOSIT  and  EXAMINE  keys,  START,  STOP, 

CONTINUE,  and  SINGLE  STEP  will  be  provided,  as  well  as  EXAMINE 
NEXT  and  DEPOSIT  NEXT.  The  address  for  the  first  instruction  to 
be  executed  following  START  is  located  in  register  2  of  the  x-memory. 

Additional  display  lights  will  include  the  Instruction 
Register  (IR) ,  Input  Buffer  Register  (IBR) ,  and  Output  Buffer 
Register  (OBR)  . 

9 .  Miscellaneous 

a)  Timing:  It  is  anticipated  that  all  instructions 
except  multiply  will  tako  200  nsec,  and  multiply  will 
require  300  nsec.  These  are  conservative  figures  and, 
as  more  packages  become  available  in  Schottky  1TL,  will 
probably  be  revised  downward. 

b)  si 7.0 :  We  expect  that  the  entire  processor  will  require 
5  i/4"  rack  height.  About  300-400  IC's  will  be  required 
(the  multiplier  alone  requires  100  74Sl81's). 
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10.  Programming  Examples:  In  this  section  we  show  several  programs , 
ranging  from  the  straightforward  to  the  spectacularly  obscure ,  which 
illustrate  design  features  of  the  machine. 

a.  Complex  multiplication: 

a.4  j  b 

-bO  *  i  +v,t) 

ST  v/ 


Initially,  let  X  and  Y  index  registers  contain  N  and  L  respectively, 
and  the  D  index  register  contain  M.  The  result  is  to  go  in  loc¬ 
ations  M  and  M+l.  Therefore,  initially,  we  have; 


lOc'W ;  iV 

xW 


C.ttvlw'T 


fAvM. 

*)«■>  5 

/  C  s 

/  Cw0  T 

*  >  ^ 

-4 

T 

< 

/  (be)  -»  v 

4,T,  * 

f\OV 

L\  J  v  *  1 

j  ^  ^  A 

n 

Mt  l 


Note:  means  enable  index  reqister  for  indicated  field.  "*1" 

1  ORed  with  the  appropriate  index  register. 
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b.  Second  Order  Difference  Equation:  The  second  order  diff¬ 
erence  equation , 

**  *  •» 

may  be  implemented  by  setting  aside  two  storage  locations:  Y1  for 
yn  1  '  and  Y2  for  yn-2  '  and  Performin9  the  following  sequence; 

i  Ga„.x 
^-1  *  3*-l. 


T  * 

which  is  coded  as, 

4 

KflVG. 

/  H 

\  w 

V*,  \o 

/  (  y.  -»  G  ■  V2. 

no  >»£ 

V7_ 

\ 

1  (z- 

-+  H\) 

Clearly,  a  coefficient 

C  other  than  unity 

for  xn 

can  be  handled 

simply  by  replacing  the 

first  MOVE  by  a  MUL 

(  MUL 

C,  X,  10  ). 

It  should  be  noted  that  at  the  end  of  the  sequence,  the  Z  register 
(the  addition  entry  port  to  the  multiplier)  has  been  cleared 
since  the  last  multiply  didn't  store  in  it. 

c.  Sum  and  Difference  of  Two  Buffers:  For  convenience,  we 
assume  the  four  buffers  start  at  locations  1,  101,  201,  and  301, 
and  that  the  length  N  of  the  buffers  is  ■  100.  Also,  A^  and 
can  be  in  "overlapping"  X  and  Y  locations,  and  the  Sum  and  Diff¬ 
erence  could  be  both  in  X  or  both  in  Y ,  or  both  in  both  X  and  Y. 
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Note.  DOWN  mutt  be  an  even  location  in  the  Instruction  Memory. 


d-  Stack  Manipulation:  In  this  example,  PUSH  JUMP  and  POP 
RETURN  are  realised.  The  subroutine  S  is  assumed  to  have  an 
associated  stack  SSTACK  and  stack  pointer  SPOINT. 
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e.  Bit-reversed  Counter:  Given  a  register,  COUNT,  containing 
a  number  to  be  incremented  in  bit  reversed  order,  and  another  reg¬ 
ister,  WHERE,  containing  the  number  of  the  bit  position  (LSB=0) 
at  which  the  count  should  start: 


The  following  program  works  because  in  single-source  instructions, 
if  the  Y  index  control  bit  is  1,  the  Y  index  register  will  be  ORed 
with  the  bits  of  the  instruction  which  are  in  the  same  position  as 
the  8-bit  Y  address  of  double-source  instructions: 
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BITC  (test  bit  and  complement  it  )  will  skip  until  the  countinq 
operation  is  complete.  Thus  LOOP  is  the  even  location  reached  on 
counting  done,  and  LOOP+1  contains  the  instruction  that  steps  the 
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test  bit  loeation  to  the  right,  and  jumps  to  the  BITC  test  of  the 
new  selected  bit. 


f.  Division:  The  following  mnemonics  are  assumed: 

MOVLO  shift  in  Ones  from  the  right,  and  shift  out  through 
the  link 

MOVLZ  shift  in  Zeroes  from  the  right,  and  shift  out  through 
the  link 

MOVL  rotate  left  through  the  link 
(  By  addition  of  one  instruction  to  the  routine  (preset  the  link) , 
the  requirements  can  be  reduced  to  only  the  availability  of  MOVL.) 

The  following  is  a  routine  for  the  positive  integer  divide  of 
the  double  precision  number  (A,B)  by  C. 
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Abstract — When  digital  signal  processing  operations  are  imple¬ 
mented  on  a  computer  or  with  special-purpose  hardware,  errors  and 
constraints  due  to  finite  word  length  are  unavoidable.  The  main  cate¬ 
gories  of  finite  register  length  effects  are  errors  due  to  A  D  conver¬ 
sion,  errors  due  to  roundoffs  in  the  arithmetic,  constraints  on  signal 
levels  imposed  by  the  need  to  prevent  overflow,  and  quantization  of 
system  coefficients.  The  effects  of  finite  register  length  on  implemen¬ 
tations  of  linear  recursive  difference  equation  digital  filters,  and  the 
fast  Fourier  transform  (FFT),  are  discussed  in  some  detail.  For  these 
algorithms,  the  differing  quantization  effects  of  fixed  point,  floating 
point,  and  block  floating  point  arithmetic  are  examined  and  com¬ 
pared. 

The  paper  is  intended  primarily  as  a  tutorial  review  of  a  subject 
which  has  received  considerable  attention  over  the  past  few  years. 
The  groundwork  is  set  through  a  discussion  of  the  relationship 
between  the  binary  representation  of  numbers  and  truncation  or 
rounding,  and  a  formulation  of  a  statistical  model  for  arithmetic 
roundoff.  The  analyses  presented  here  are  intended  to  illustrate 
techniques  of  working  with  particular  models.  Results  of  previous 
work  are  discussed  and  summarized  when  appropriate.  Some  ex¬ 
amples  are  presented  to  indicate  how  the  results  developed  for 
simple  digital  filters  and  the  FFT  can  be  applied  to  the  analysis  of 
more  complicated  systems  which  use  these  algorithms  as  building 
blocks. 

I.  Introduction 

IN  PRACTICE,  digital  signal  processing  requires  the  rep¬ 
resentation  of  sequence  values  in  a  binary  formal  with  a 
finite  register  length.  The  effect  of  the  finite  word-length 
constraint  manifests  itself  in  several  different  ways.  If  a  se¬ 
quence  to  be  processed  is  derived  by  sampling  an  analog 
waveform,  then  the  finite  word-length  constraint  requires 
that  the  analog-to-digital  conversion  produce  only  a  finite 
number  of  values.  This  represents  quantization  of  the  input 
waveform.  Even  when  we  start  with  data  representable  with 
a  finite  word  length,  the  result  of  processing  will  naturally 
lead  to  values  requiring  additional  bits  for  their  representa¬ 
tion.  For  example,  a  6-bit  data  sample  multiplied  by  a  6-bit 
coefficient  results  in  a  product  which  is  26  bits  long.  If  in  a  re¬ 
cursive  digital  filter  we  do  not  quantize  the  result  of  arith¬ 
metic  operations,  the  number  of  bits  required  will  increase 
indefinitely,  since  after  the  first  iteration  26  bits  are  required, 
after  the  second  iteration  36  bits  are  required,  etc.  The  effect 
of  quantization  in  such  a  context  depends  on  such  factors  as 
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whether  we  are  considering  fixed  point  or  flouting  point  arith¬ 
metic,  and  whether  for  fixed  point  arithmetic  wt  are  using  a 
representation  of  numbers  in  terms  of  fractions  or  integers, 
or  perhaps  a  mixture  We  will  be  treating  the  ease  of  fixed- 
point  arithmetic  and  floating-point  arithmetic  separately. 
For  fixed-point  arithmetic,  it  is  natural  in  ..  signal  processing 
context  to  consider  a  register  as  representing  a  fixed-point 
fraction.  In  this  way  the  product  of  two  numbers  remains  a 
fraction  and  the  limited  register  length  can  be  maintained  by 
truncating  or  rounding  the  least  significant  bits.  With  this 
t\q>e  of  representation  the  result  of  addition  on  fixed-point 
fractions  need  not  be  truncated  or  rounded  but  it  can  increase 
in  magnitude  so  that  the  sum  eventually  is  not  a  fraction. 
This  effect  is  commonly  referred  to  as  overflow,  and  can  be 
handled  by  requiring  that  the  input  data  be  sufficiently  small 
so  that  the  possibility  of  overflow  is  avoided.  In  considering 
floating-point  arithmetic,  dynamic  range  considerations  gen¬ 
erally  can  be  neglected  due  to  the  large  range  of  representable 
numbers  but  quantization  is  introduced  both  for  multiplica¬ 
tion  and  for  addition. 

A  third  effect  of  finite  word  length  is  inaccuracies  in  pa¬ 
rameter  values.  While  generally  signal  processing  parameters 
are  initially  s|iecified  with  unlimited  accuracy,  they  can  only 
be  utilized  with  finite  word  length.  This  effect  is  similar  to  the 
effect  which  arises  in  implementing  analog  processing  using 
inaccurate  circuit  elements.  There  are  two  possible  approaches 
to  handling  the  inaccuracies  in  parameter  values.  One  possi¬ 
bility  is  to  develop  design  procedures  which  inherently  are 
insensitive  to  parameter  inaccuracies.  An  alternate  is  to  choose 
specifications  which  are  consistent  with  the  limited  register 
length.  There  is  a  certain  amount  that  is  understood  about 
the  effect  of  inaccuracies  in  parameter  values,  but  for  the 
most  part  present  results  lead  to  guidelines  rather  than  hard 
design  or  analytical  strategics. 

In  the  following  discuss, on  the  relationship  between  the 
binary  representation  of  „u  others  and  truncation  or  rounding 
is  discussed  and  a  statistical  model  for  arithmetic  roundoff  is 
presented.  This  statistical  model  is  then  applied  to  the  anal¬ 
ysis  of  fixed-point  and  floating-point  rounding  errors  in 
digital  filters.  The  analysis  includes  a  consideration  of  the 
effect  of  dynamic  range  in  developing  and  comparing  signal- 
to-noise  ratios  for  fixed-point  and  floating-point  filters.  It 
is  not  always  possible  to  treat  the  effects  of  arithmetic  round¬ 
off  in  terms  of  a  s-'mple  statistical  model.  Some  approaches 
and  results  are  available  in  the  literature  on  the  limit  cycle 
behavior  of  digital  filters  due  to  arithmetic  roundoff,  and  a 
discussion  of  soirv  of  these  results  is  included. 

For  the  analysis  of  arithmetic  roundoff  in  computation  of 
the  discrete  Fourier  transform  using  the  fast  Fourier  trans- 
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form  (FIT)  algorithm  a  statistical  model  is  used.  With  this 
model  the  signal  to-noise  ratio  is  developed  and  compared 
for  fixed  point  and  floating-point  arithmetic. 

While  for  any  given  filter  configuration  or  spectral  anal- 
vsis  problem  it  can  be  difficult  to  carry  out  a  detailed  anal¬ 
ysis  of  the  effects  of  finite  register  length  there  are  a  number 
of  general  guidelines  that  can  be  distilled  from  the  results  pro 
sented  here  In  Section  IV  some  examples  and  guidelines  are 
presented  for  filters  implemented  with  fixed  point  arithmetic 
and  with  floating  point  arithmetic  as  well  as  for  filters  im¬ 
plemented  with  the  FFT. 

I  his  paper  is  intended  primarily  as  a  tutorial  review  of  a 
subject  which  has  received  considerable  attention  over  the 
past  few  years.  I  he  analyses  which  are  presented  here  are 
selected  to  illustrate  techniques  of  working  with  particular 
models.  Previous  work  is  freely  referenced,  discussed,  and 
borrowed  front. 

II.  Number  Representation  and  Its 
Iutect  ON  Quantization 

A.  Fixed-Point  and  Floating-Point  Numbers 

The  manner  in  which  finite  word  length  effects  are  mani¬ 
fested  is  closely  tied  to  the  way  in  which  numbers  arc  repre¬ 
sented. 

Digital  computers  and  special  purpose  digital  hardware 
for  the  most  part  use  a  number  representation  with  a  radix  of 
2.  i.e.,  a  binary  representation.  Therefore,  a  number  is  repre¬ 
sented  by  a  sequence  of  binary  digits  which  are  cither  *e-  . 
or  unity.  Just  as  a  decimal  number  is  represented  as  a  string 
of  decimal  digits  with  a  decimal  point  dividing  the  integer 
part  from  the  fractional  part,  the  sequence  of  binary  digits 
is  divided  by  a  binary  point  into  those  representing  the  in¬ 
teger  part  of  the  number  and  those  representing  the  frac¬ 
tional  part.  I  has  if  A  denovs  the  location  of  the  binary  point, 
the  binary  number  1001^01 10  has  the  decimal  value  of  (1  X23 

fOX22  +  ()X2'-f  1  X20)-f (0X2  '  +  1  X2  2+  1  X2  3  +  ()X2  <). 

I  his  representation  always  corresponds  to  a  positive  number. 

I  he  manner  in  which  arithmetic  is  implemented  in  a 
digital  computer  or  in  a  special  purpose  hardware  depends  on 
where  in  the  register  the  binary  point  is  located.  For  fixed- 
point  arithmetic,  the  implementation  is  based  on  the  assump¬ 
tion  that  the  location  of  the  binary  point  is  fixed.  The  manner 
in  which  addition  is  carried  out  will  not  depend  on  the  location 
of  the  binary  point  for  fixed-point  arithmetic  as  long  as  it  is 
the  same  for  every  register.  For  multiplication,  however,  the 
location  of  the  binary  point  must  be  known.  For  example, 
consider  the  pioduct  of  the  two  4-bit  numbers  1001  ^  ami 
001  h  In  general,  of  course,  the  product  of  two  6-bit  numbers 
"ill  Ih‘  2b  l,its  The  8- bit  product  of  the  above  number  is 
0001 101 1  j.  If,  on  the  other  hand,  we  consider  the  4-bit  frac¬ 
tions  a  1001  and  jOOll,  then  the  8-bit  product  is  aOOOIIOII 
In  digital  filtering  applications,  it  is  usually  necessary  to  ap¬ 
proximate  the  26-bit  product  of  two  6-bit  numbers  by  a  6- 
b't  result.  In  integer  arithmetic  this  is  difficult.  With  frac¬ 
tional  arithmetic,  on  the  other  hand,  this  can  be  accomplished 
b\  truncating  or  rounding  to  the  most  significant  6  bits.  For 
multiplication  with  fractions,  overflow  can  never  occur  since 
’he  product  of  two  fractions  is  a  fraction.  Thus  for  the  4-bit 
example  previously  mentioned,  the  product  aOOOIIOII  can  be 
approximated  by  a<>001  (truncation)  or  a<)010  (rounding). 

An  alternative  to  fixed-point  arithmetic  is  a  floating-point 
representation.  In  thiscase,  a  positive  number  /•  is  represented 
as  F=  2  .! /,  where  1/,  the  mantissa,  is  a  fraction  between  1  2 
and  1,  and  c,  the  characteristic,  can  be  either  positive  or 


negative.  The  product  of  two  floating-point  numbers  is 
carried  out  by  multiplying  the  mantissa  as  fixed  point  frac¬ 
tions  and  adding  the  characteristics.  Since  the  product  of  the 
mantissas  will  be  between  1/4  and  1,  ;l  normalization  of  the 
mantissa  and  corresponding  adjustment  of  the  characteristic 
mav  be  necessary.  The  sum  of  two  floating  point  numbers  is 
carried  out  by  scaling  the  mantissas  of  the  smaller  number  to 
the  right  until  the  characteristics  of  the  two  numbers  are 
equal  and  then  adding  the  mantissas.  For  example,  consider 
the  sum  of  F,  and  Ft  with  Ft  =  4  and  Ft  =  5  4  I  hen  in  floating 
pin n t  notation,  F,  =  2r| .1/,.  and  F!-2rM/J  with 

fi  =  1  1a  (  =  .!  decimal) 

M\  =  a  1»(H)  (  =  0.5  decimal) 

c5  =  Ia  (=1  decimal) 

A/:=  a1010  (  =  5/8  decimal). 

Ill  order  to  carry  out  the  addition,  r,  must  be  changed  to  equal 
Ci  and  .!/,  must  be  adjusted  accordingly.  Thus  first  the  repre¬ 
sentation  of  Ft  is  changed  to  F2  =  2fM/2  with 

It  =  Ha 
Alt  =  aOOIOI 

in  which  case  the  mantissas  can  now  be  added.  The  resulting 
sum  is  F=T.\1  with  r=  11a  and  .1/  =  aIOIOI.  In  this  case  the 
sum  of  .1/)  and  -l/2isa  fraction  between  1/2  and  1  and  therefore 
no  further  adjustment  of  c  has  to  be  carried  out.  In  a  more 
general  case,  the  sum  may  not  be  in  that  range,  and  conse¬ 
quently.  c  would  be  adjusted  to  bring  the  mantissa  into  the 
proper  range,  '-'rom  this  example  it  should  be  clear  that  in 
general  with  floating-point  arithmetic,  the  mantissa  can  ex¬ 
ceed  the  register  length  and  must  therefore  be  truncated  or 
rounded  for  both  addition  and  multiplication  whereas  this  is 
only  necessary  for  multiplication  in  the  fixed -point  case.  On 
the  other  hand,  if  the  result  of  addition  in  the  fixed  point 
ca.c  exceeds  the  register  length,  truncation  or  rounding 
wni  not  help,  i.e..  the  dynamic  range  has  been  exceeded  Thus 
"  I'ile  floating  point  introduces  error  due  to  arithmetic  round¬ 
off,  it  provides  much  greater  dynamic  range  than  fixed  point. 

W  we  will  see  later,  both  of  these  effects  must  be  considered 
when  comparing  fixed-point  and  floating-point  realizations 
of  digital  filters. 

B.  Representation  nj  Segatlve  Numbers 

I  here  are  three  common  means  used  for  representing 
fixed-point  negative  numbers.  The  first,  and  most  familiar,  is 
sign  and  magnitude,  i.e.,  the  magnitude  (which  is  of  course 
Positive)  is  represented  as  a  binary  number  and  the  sign  is 
represented  by  the  leading  binary  digit  which,  if  0  corresponds 
to  a  +  and  if  1  corresponds  to  a  -  („r  vice  versa).  Thus  for 
example,  in  sign  and  magnitude  OaOOII  represents  A  10  and 
I  a001  1  represents  —  A  16.  Two  other  related  representations 
of  negative  numbers  are  often  referred  to  as  one’s-comple- 
meut  and  txvo's-romplenicnt  representations.  Considering  all 
numbers  to  be  fractions,  a  positive  number  is  represented  as 
before.  For  two's  complement  representation  a  negative  mini 
ber  is  represented  by  2.0  minus  its  magnitude.  For  example 
-(OaOIIO)  in  sign  and  magnitude  is  represented  as  1^  1010 
in  tvvo's-com piemen t  since  10a00()-0a0100=  1  a  1 0 1 0.  For 
one's-com piemen t,  the  negative  number  is  represented  by- 
subtracting  the  magnitude  from  the  largest  number  repre¬ 
sentable  in  the  register.  Thus  -(OaOIIO)  is  represented  bv 
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1a!  Ill)  —  (0^0110)  =  ljlOOl,  One's  complement  representa¬ 
tion  is  equivalent  to  representing  a  negative  . . her  by  the 

hit  In  hit  complement  of  its  magnitude.  The  choice  of  repre¬ 
sentation  for  negative  numhers  in  a  particular  system  is 
usually  ha  set)  almost  entirely  on  hardware  considerations, 

I' °r  t,u>  representation  of  negative  lloa ting- point  numhers 
there  are  a  variety  of  conventions  that  have  heen  used.  In 
this  paper  we  will  consider  the  sign  of  the  number  to  he  asso¬ 
ciated  with  the  mantissa  so  that  the  mantissa  is  a  signed 
fraction  The  representation  of  this  signed  fraction  can  of 
course  he  in  sign  and  magnitude,  one's-complement  of  two's- 
complement  notation. 

C.  A  Model  for  Arithmetic  Roundoff 

In  formulating  a  model  for  arithmetic  roundoff,  we  shall 
consider  both  fixed-point  numhers  and  mantissas  of  lloating- 
point  numhers  to  he  represented  as  * 4- 1  -hit  binary  fractions, 
with  the  binary  point  just  to  the  right  of  the  highest  order  hit 
(or  sign  hit).  This  convention  represents  no  loss  of  generality, 
and  its  convenience  has  heen  alluded  to  above.  The  numerical 
value  (for  positive  numbers)  of  a  one  in  the  least  significant 
hit  is  2~\  and  this  quantity  can  he  referred  to  as  the  width  of 
quantization. 

As  indicated  previously,  the  effect  of  finite  register  length 
on  the  result  of  arithmetic  operations  depends  on  whether 
fixed-point  or  floating-point  arithmetic  is  used,  and  ho-v  nega¬ 
tive  numhers  are  represented.  Let  us  consider  first  the  effect 
of  truncation  and  rounding  in  the  fixed-point  case.  For  sign 
and  magnitude,  one's-complement  and  two’s-completnent, 
the  representation  of  positive  numhers  is  identical  and,  cor 
sequen  ly.  so  is  the  effect  of  truncation  and  rounding.  If  ET 
denote;  tl  e  crw.r  due  to  truncation,  i.e.,  the  value  after  trun¬ 
cation  nanus  the  value  before  truncation,  this  error  will  al¬ 
ways  he  negative  for  positive  numhers.  That  is,  the  effect  of 
truncation  is  to  reduce  the  value  of  the  numhers.  More 
specifically,  if  ft.,  denotes  the  number  of  hits  (exclusive  of  sign) 
after  truncation,  and  b,  denotes  the  number  of  bits  before 
truncation,  then  the  result  satisfies  t)  >Ht>  --(2~h,  —  2~b'). 

With  sign  and  magnitude  representation  of  negative  num¬ 
hers,  truncation  reduces  the  magnitude  of  the  number  and  the 
error  liT  satisfies  0  <liT  <  (2~*»-  2~»).  For  a  twoWomple- 
ment  negative  number  represented  by  the  hit  string  lj,  a,, 

•  •  •  ,  (it,,,  the  magnitude  is  given  by 

Mx  =  2.0-  x, 

where 

-Vi  =  1  + 

1 

Truncation  to  ft2  bits  (ft2<fti)  produces  the  hit  string  1^,  iix. 
<i-i,  ■  ■  ■  ,  (tkj,  where  now  the  magnitude  is 

M,  =  2.0  -  ,v2 

\vi  t  h 

62 

=  1  4-  yi 

I- 1 

The  change  in  magnitude  is 

AM  =  Mz  —  M,  =  £  a,2-i 


^  - 1  ruination 

-»*•  (£’»  compltmtnt)  (l’«  compl»m«n»  «nd  »ijn 

j  j  .  j  1  Oi0|i|-i  >2*1  ond  mognltudt 

OiO(D-i>2-‘  ;  1  >0 
0  S  ,  m<o 

log,  1.  Transfer  characteristics  for  rounding  and  truncation. 

and  it  is  easily  seen  that 

0  <  AM  <  2-'»  -  2~h\ 

Hence  the  effect  of  truncation  for  two’s-complement  negative 
numbers  is  to  increase  the  magnitude  of  the  negative  number; 
the  truncation  error  is  negative,  and  satisfies  0>Er>—  (2_6* 
—  2-*').  ~ 

For  a  one's-complement  negative  number  represented  by 
the  bit  string  1^,  ax,  «2,  •  •  •  ,  up,,  the  magnitude  is  giv  en  by 

M,  =  2.0  -  2 -*»  -  .r, 

and  tnmeation  to  b?  bits  yields  a  magnitude 

M 5  =  2.0  -  2“‘»  -  ,vs 

"  I'vrc  .v,  and  ,v2  are  as  defined  above.  The  change  in  magnitude 
is 

t>i 

AM  =  M-,  —  .1/,  =  (Ji2  —  (2-*«  —  2“6') 

I  ! 

and  now 

0  >  SM  >  -  (2  1,1  -  2  bt). 

Hence  the  effect  of  truncation  for  one’s  complement  negative 
numbers  is  to  decrease  the  magnitude  of  the  negative  mini 
her;  the  truncation  errot  is  positive,  anil  satisfies  0</'7-<2“-2 
-2  b' . 

I  he  effect  of  rounding,  of  course,  will  be  the  same  inde- 
IK-ndent  of  how  negative  numbers  are  represented  and  the 
rounding  error  will  always  be  greater  than  or  equal  to 
(-1  2)2  *  and  less  than  or  equal  to  (+1  2)2  h.  The  effect  of 
truncation  and  rounding  for  the  fixed-point  case  is  summarized 
m  Fig.  ,  where  .v  represents  the  value  before  truncation  or 
rounding  and  ()(.x)  represents  the  value  after.  I11  the  figure  it 
is  assumed  that  ,v  can  take  on  a  continuous  range  of  values, 
correspom'ie _  to  ft  1=  *  in  the  discussion  above,  and  that  the 
quantized  word  length  is  ft  hits  plus  sign. 

For  the  case  of  lloating-point  arithmetic,  the  effect  of 
truncation  or  rounding  is  reflected  only  in  the  mantissa.  It  is 
convenient  in  the  lloating-point  case  to  describe  the  error  in  a 
multiplicative  sense  rather  than  in  an  additive  sense  as  is 
done  in  fixed-point  arithmetic.  I11  other  words,  for  a  lloating- 
point  word,  if  .v  represents  the  value  before  truncation  or 
rounding  and  Q(.x)  represents  the  value  after,  then  we  express 
(K.v)  as  equal  to.v(l+t).  For  the  case  of  rounding,  for  example, 
the  error  in  Hie  mantissa  is  between  +2"6/2,  and  conse¬ 
quently  the  error  in  the  value  of  the  floating-point  word  is 

2  b  2  b 

— —  <  Q(x)  -  ,T  <  2' - 

2  2 
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p(E.) 


(b) 


/a  I’robnbililj- density  Hindi, m  lor  . . ding  n, (|,)  l>i,,h;i 

Inlity  density  function  for  now-  dm-  to  IwoVcotnpleniem  triinealioti. 

or,  since  Q(x)-x=tx 

2  b  i  b 

—  2' •  —  <  tx  <  2'-  - 
2  ~  ~  i 

nnd  since  T  1  <.v<2f,  we  can  write  that  for  the  use  of  'omul 
h>H  “2  b<t<2 b.  In  a  similar  manner  we  can  show  that  for 
one's-coinplenient  ami  for  sign  and  magnitude  truncation 
— —  2  2  b.  l-or  two's-contplement  truncation 

0  >  f  >  _  2-2  \  j>  0 

<><e<2-2\  ,•<(). 


addition.  For  fixed  point  arithmetic,  ronndolT  is  introduced 
only  after  the  multiplication.  Uecanse  of  the  possibility  of 
overflow  due  to  addition,  there  is  a  dynamic  range  limitation 
in  lived -point  filters.  In  contrast,  floating-point  filter  imple¬ 
mentation  has  a  much  less  severe  dynamic  range  constraint, 
although  arithmetic  roundoff  is  introduced  due  to  both  multi 
plication  and  addition.  In  the  next  sections  we  will  first  de¬ 
velop  the  statistical  analysis  of  arithmetic  roundoff  for  fixed- 
pi  mt  Idlers  including  dynamic  ran-  considerations.  This  is 
h  Mowed  by  a  statistical  analysis  for  lloating-point  arithmetic 
and  a  discussion  of  zero  input  limit  cycle  behavior  for  fixed 
point  arithmetic. 

Statistical  Analysis  of  Fixed -Point  Errors  in  a  Digital 
hitter  I-/].  1 5] 

la  many  situations  it  is  reasonable  to  model  the  effect  of 
rounding  in  a  digital  filler  by  a  simple  statistical  model  I  he 
approach  is  to  model  the  effect  of  the  rounding  at  each  multi¬ 
plier  by  a  white-noise  source  uniformly  distributed  in  ampli 
Hide  between  plus  and  minus  (1  2)2  b.  Kacli  of  the  noise 
sources  is  assumed  to  be  linearly  independent  of  each  other 
and  of  the  input.  Fxperimentally  these  assumptions  have 
been  justified  for  a  broad  class  of  inputs  including  random 
signals,  speech,  etc.  The  model  is  clearly  not  valid  for  certain 
inputs,  such  as  constant  inputs.  If  the  impulse  response  from 
the  A'tli  noise  source  to  the  output  is  lik(n)  then  the  steady- 
state  output  noise  \  ariance  due  to  the  *th  noise  source  is 


l>.  .Statistical  Model  of  Arithmetic  Roundoff 

A  convenient  means  for  analyzing  the  elTeet  of  quantiza¬ 
tion  is  to  represent  the  error  statistically  j  1  ].  [ 2 J .  In  par¬ 
ticular.  for  the  case  of  fixed-point  arithmetic  and  rounding  Et 
is  represented  as  a  random  variable  with  a  probability  density 
shown  in  Fig.  2  a).  For  the  case  of  t wo’s-complenient  trunca¬ 
tion,  the  probability  density  is  shown  in  Fig.  2(b). 

In  each  of  these  cases,  the  assumption  is  that  the  random 
variable  Et  is  independent  of  .v.  For  one's  complement  and 
sign  magnitude  truncation,  this  assumption  cannot  lie  made 
since  the  mean  value  of  the  error  is  directly  correlated  with 
t  he  sign  of  v.  I  n  the  analysis  that  follows  for  fixed -point  arith¬ 
metic,  the  discussion  is  phrased  in  terms  of  rounding.  The  re¬ 
sults  are  easily  modified  for  twu’s-complenient  truncation.  In 
particular  the  variance  of  the  noise  is  identical  for  both  cases. 
However,  for  rounding  the  noise  is  zero  mean  and  for  two’s- 
complement  truncation  it  is  not  zero  mean. 

F’or  the  floating-point  ease,  the  parameter  t  is  considered 
to  be  a  random  variable  which  is  independent  of  y.  In  that 
case  the  assumption  of  independence  is  reasonable  for  round 
mg,  sign  and  magnitude  truncation,  and  one’s-complement 
truncation,  but  not  for  two’s-complement  truncation.  The 
random  variable  t  is  bounded  by  -2~6<t<2  b.  We  will  gen¬ 
erally  assume  t  to  be  uniformly  distributed  in  this  range  with  a 
variance  <r«*=  (1/3)2  *  Fmpirical  work  has  shown  that  the 
distribution  is  not  quite  uniform  so  that  while  <7, 2  is  propor¬ 
tional  to  2  26,  the  constant  of  proportionality  is  slightly  less 
than  13.  However,  the  interpretation  of  the  results  depends 
primarily  >n  the  proportionality  to  2  2\ 

HI  Finite  Kec.istkk  Length  Litects 
for  Digital  Filters  [3] 

A.  Introduction 

The  basic  arithmetic  operations  involved  in  implementa¬ 
tion  of  n  digital  filter  are  multiplication  by  a  constant  and 


. <k  -  cr.-  ^  /i,-(H)  (1) 

w-0 

where  *,*«(!  12)2  *  Since  all  the  noise  sources  are  as¬ 
sumed  to  be  uncorrelated,  the  total  output  noise  is 

a."  =  £  <r (2) 

k 

For  example,  if  we  consider  the  lirsl-order  filter  in  l  ig.  3  one 
noise  source  is  introduced.  In  this  case,  the  impulse  response 
from  the  noise  source  input  to  the  output  is  Inn)  =  .i"/,  p n) 
where  u  ,  denotes;!  unit  stop  sequence,  so  that 


12  l-„  w 

l  or  a  second order  filler  with  o  ie  complex  pole  pair  there  are 
two  noise  sources  as  indicated  ii  Fig.  4.  The  resulting  output 
noise  is 


/  1  +  r-  1  \ 

2  »( -  \ 

\1  -  r2  r4+  1  —  2r-  cos  20  ) 


C  Dynamic  Raiij’e  Considerations  for  Fixed-Point  Eilters 

As  indicated  previously,  the  possibility  of  overflow  must 
be  considered  in  the  implementation  of  digital  filters  with 
fixed-point  arithmetic.  With  the  convention  that  each  fixed- 
point  register  represents  a  signed  fraction,  each  node  in  the 
filter  must  be  constrained  to  maintain  a  magnitude  less  than 
unity  in  order  to  avoid  overflow.  Letting  .v(«)  denote  the  filter 
input  and  y *(»)  and  M»)  denote  the  output  and  unit  sample 
response  for  the  k tli  node  in  the  filter,  then 


v*(w)  =  ^  lh(r).r(n  —  r).  i 

r~0 

If  -Vnmx  denotes  the  maximum  of  the  absolute  value  of  the 
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a 


Ki'k-  3.  Noisy  first -onlcr  filter  (fixed  ix>int). 


Ktg.  4.  Noisy  second-order  filter  (fixed  point). 


put  then 

aC 

I  }'a(h )  <  .v„,„  Y  ht(.r)  •  (6) 

r-0 

Thus,  since  we  require  that  [  v*(«)|  <1  (6)  requires  that 

<  1  /  Y  h*(r)  .  for  all  k.  (7) 

'  r  I) 

Equation  (7)  thus  provides  an  upper  hound  on  the  maximum 
value  of  the  input  to  insure  that  no  overflow  occurs  in  the  Hh 
node.  For  a  general  input  (7)  in  fact  provides  a  least  upper 
hound,  i.e.,  if  the  maximum  value  of  the  input  exceeds  toe 
hound,  overflow  can  occur.  This  is  a  consequence  of  the  fact 
that  equality  can  he  achieved  in  (6)  with  a  sequence  v(»)  for 
which  at  u  —  n„,  .v(«„  —  r)  =  [sgn  /t*(r)]  for  r  =  ()  to  x .  (Where 
sgn  (,v)  =  1  for  x  >0  and  sgn  (,v)  =  —  1  for  x  <0.)  Thus  in  the 
most  general  case,  (7)  is  required  to  guarantee  that  no  over¬ 
flow  occurs.  The  condition  in  (7)  would  generally  he  satisfied 
by  applying  attenuation  to  the  signal  at  the  filter  input. 

If  we  assume,  for  example,  that  the  input  x(n)  is  a  white- 
noise  sequence  with  a  uniform  amplitude  distribution,  we 
would  choose  for  the  case  of  the  first-order  filter  a  maximal 
input  amplitude  of  (1— «).  For  this  case,  if  a,2  denotes  the 
variance  of  the  input  signal,  and  <ru 2  denotes  the  variance  of 
the  output  signal,  then 


For  this  example,  we  can  then  compute  a  iwisc-to-signal  ratio 
as  the  ratio  cr0  1  ’a2  wi  th  the  result 


On" 


1  1 

2  2>>  _ _ _ 

4  (i  -  oy- 


(9) 


In  a  similar  manner  we  can  derive  a  noise-to-signa!  rati  >  for 
the  second-order  filter  shown  in  Fig.  4.  As  in  the  first-order 
case,  we  restrict  the  maximum  input  in  order  to  guarantee 


that  the  dynamic  range  of  the  registers  is  not  exceeded  If  we 
consider  the  input  sequence  to  be  uniformly  distributed  white 
noise,  the  resulting  output  noise-to-signal  ratio  will  he 

=  2~2i(-~ —  Y  r"  I  sin  ((«  +  1)«J  |  )  ■  (10) 

2  Vstn  0  „  0  / 

While  it  is  difficult  to  evaluate  this  expression  exactly,  it  is 
possible  to  obtain  an  upper  and  lower  bound  Since  51", o  b„  is 
the  largest  possible  output  obtainable  with  an  input  that 
never  exceeds  unity,  it  must  be  larger  than  the  response  of 
the  second-order  filter  to  a  sinusoid  of  unity  amplitude  at  the 
resonant  frequency  With  this  consideration,  we  can  write  that 


1/(1  -  r)2(l  +  r2  -  2rcos20)  (11) 


since  the  i  ight-hand  side  of  this  inequality  is  the  gain  at  reso¬ 
nance.  Furthermore, 

(-1-  £  !  sin  [(„  +  1)0] !  V  <  (-4-T  ±  r»V.  (12) 

Vstn  0  ,,„o  /  Vstn  0  „„0  / 

Theiefore,  for  the  second-order  case 

1  1 

2  ** 

2  (1  -  r)"-(l  +  r2-  2rcos2fl) 

(To2  1  1 

<  —  <  2”26  -  •  (13) 

“  cy-  ~  2  sin2 0(1  -  r)2 

F'or  both  the  first-  and  second-order  filter  an  expression  for 
the  noise-to-signal  ratio  ran  he  obtained  which  prov  ides  some 
insight  into  the  behavior  of  the  noise-to-signal  ratio  as  the 
poles  approach  the  unit  circle.  For  the  first-order  filter  let 
5=1  —  a  so  that  as  6— >0,  the  pole  approaches  the  unit  circle. 
Then  in  terms  of  6,  the  noise-to-signal  ratio  for  the  first-order 
filter  is 

c2  1  1 

-  =  -  2~2*  —  •  (14) 

■i  i  et  ’ 

<7y“  *♦  0“ 

For  the  second-order  filter,  let  5=  1  —  r  so  that,  again,  as  5— >0 

the  poles  approach  the  unit  circle.  Then  if  we  assume  that 
5  <K  1 ,  we  can  approximate  (1  +r2—  2r  cos  26)  its 

(1  +  r2  —  2r  cos  26)  =  4  sin2  6  +  52  (15) 


which  for  4  sin2  9  large  compared  with  52  we  will  approximate 
as  4  sin2  6.  Consequently,  incorporating  this  approximation, 


1  <r„2  1  1 

- <  _  <  -  2~2b  — 

452  sin20  <r„2  2  52  stn2  0 


(16) 


Thus  we  observe  that  the  noise-to-signal  ratio  as  considered 
thus  far  can  be  considered  to  be  proportional  to  2  2h  52  We 
note  from  this  dependence  that  if  5  is  halved,  then  to  maintain 
the  same  noise-to-signal  ratio  b  must  he  increased  by  1,  i  e., 
one  bit  must  be  added  to  the  register  length.  This  dependence 
provides  a  convenient  basis  for  comparison  of  different  over¬ 
flow  strategies  and  different  kinds  of  arithmetic. 

In  the  above  analysis,  the  filter  input  was  assumed  to  be 
uniformly  distributed  white  noise.  As  5  approaches  zero  the 
frequency  response  of  both  t lie  first-  and  second-order  filter 
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Incomes  more  selective  so  that  more  and  more  of  the  input 
energy  is  out  of  hand.  An  alternative  ha-i s  for  determining 
the  noise-to-signal  ratio  is  for  an  input  which  is  sinusoidal. 
For  this  choice  of  inputs,  of  course,  we  would  not  use  tin- 
general  condition  of  (7)  to  avoid  overflow  since  we  can  deter¬ 
mine  exactly  the  maximum  allowable  input  amplitude  as  a 
function  of  the  filter  parameters. 

In  particular,  if  the  input  is  of  the  form  v(  »)=*,„„  cos  »q> 
then  the  steady-state  output  is  of  the  form  y(«)=y„,.x  cos 
(#+*)•  To  prevent  overflow,  y„,„  must  he  less  than  unity 
and  to  maximize  the  output  signal  energy,  y,„„  is  chosen  to  lit¬ 
as  large  as  possible.  Thus  the  maximum  noise-to-signal  ratio 
is  obtained  when  is  chosen  so  that  y(n)  =cos 
Note  that  in  order  to  choose  .v„,.x  in  this  way,  the  frequency  of 
the  input  signal  must  he  known.  For  an  input  sinusoid  of  un¬ 
known  frequency  .v„„x  must  he  attenuated  so  that  overflow 
will  not  occur  even  in  the  worst  case,  where  the  frequency  of 
the  input  coincides  with  the  peak  gain  in  the  filter’s  transfer 
function. 

For  fixed-point  f. Iters,  within  the  validity  of  the  statistical 
model  for  roundoff  error,  the  output  noise  is  independent  of 
the  form  and  amplitude  of  the  input  signal.  Thus  for  this 
choice  of  inputs,  the  noise-to-signal  ratio  obtained  for  the 
first-order  filter  is 


°o-  1  1 

_ = _ 2~-b _ 

<v  24  1  -  a- 

If,  as  before,  we  let  u  =  1  — 5,  then  for  5«1 


(17) 


a  o' 

9 


1  2~-b 
48  5 


Thus  in  this  case,  the  noise-to-signal  ratio  is  proportional  to 
15  rather  than  1/5*  so  that  if  5  is  multiplied  by  1/4  and  the 
register  length  is  increased  by  one  hit,  the  noise-to-signal 
ratio  will  remain  constant.  We  can  consider  the  second- 
order  case  in  a  similar  manner.  Again  for  a  sinusoidal  input, 
the  output  with  maximum  amplitude  has  the  form  y(u)  = 
cos  (>i<t>+\p)  so  that  the  noise-to-signal  ratio  in  this  case  is 


= _ 2~2* 

12 


/I  +  r2 

\l  -T2  1 


1 


+  r*  —  2 r-  cos  26 


)■ 


Again,  choosing  r  —  1  —  5  with  5«1 


0V 


2~56 

45  sin2  0 


(19) 


(20) 


so  that,  as  with  the  first-order  filter,  the  noise-to-signal  ratio 
is  proportional  to  1  5  rather  than  1  52.  The  comparison  in  the 
noise-to-signal  ratio  for  a  white-noise  input  and  a  sinusoidal 
input  serves  to  illustrate  the  dependence  of  the  effect  of  dy¬ 
namic  range  considerations  on  the  particular  form  of  the  in¬ 
put.  In  some  sense,  the  two  cases  considered  represent  ex¬ 
tremes.  As  the  input  becomes  more  confined  to  a  known  nar¬ 
row  hand  of  frequencies  the  above  analysis  with  a  sinusoidal 
input  would  be  more  representative,  and  as  the  input  becomes 
more  wide-hand  the  above  analysis  with  a  white-noise  input 
is  more  representative. 

In  the  above  discussion,  the  noise-to-signal  ratio  for  tile 
<  as;  of  white-noise  input  was  derived  on  the  basis  that  over¬ 
flow  must  lie  avoided.  In  a  practical  case,  a  scaling  of  the  in¬ 
put  on  the  basis  of  (7)  can  lie  considered  to  he  somewhat  jicssi- 
mistic  sinre  the  probability  of  equality  being  attained  in  (7) 


is  extremely  small.  Furthermore,  for  many  filters  it  is  difficult 
to  compute  the  sum  in  (7).  Jackson  |7]  has  formulated  the 
dynamic  range  constraints  on  fixed-point  digital  filters  in 
terms  of  /.,,  norms.  In  particular,  let  F(oi),  X(u ),  and  //(oi) 
denote  the  Fourier  transforms  of  the  filter  output,  input,  and 
system  impulse  response,  respectively.  Then  it  can  he  shown 
in  general  that 

,  y(«)  <  H  „  -V 


i  1  P  +  1/v  =  1 


(21) 


w  here  11  ,,  and  A  q  are  the  Lp  norm  and  /.,,  norm  of  // (oi) 
and  A" (w),  rested vely,  where  these  norms  are  defined  as 

NIp=[^J  | //(«) 


and 


ll-Y||,  =  [^/F|  -Y(«)N“]1  '■ 

hor  example,  with  //(to)  chosen  as  unity,  a  consequence  of 
(21)  is  that 


U(«)l  <  -v 


all  q  >  1. 


As  another  consequence,  if  we  choose  p  =  1,  </=  x,  and  use 
the  fact  that  the  Lx  norm  of  .Y(to)|  is  the  maximum  value 
of  |  A' (to)  j  then  we  obtain  the  statement  that 


y(«) !  <  max  [  A' (to)  j  ) 


1 


n 

2t r 


//(to)  </to. 


(18)  As  an  alternative,  with  p  =  2,q  =  2, 


~\  rT  “|i/s 

|y(»)|  <[-J  | //(to)  | 2  ./a,  I 


-  f'|.v 

12* 


-  1/2 

(to)  |2  l/tO 


\y(») 

M 


I  o  prevent  overflow  in  the  output  we  require  that 


<1  and  to  insure  this  from  (21)  we  will  require  that 
A'  ,<1.  Consequently,  the  input  must  be  scaled  in 


such  a  way  that 


•V||,  <  1/1,7/ ' 


(22) 


This  condition  is  somewhat  less  general  than  (7)  but  in  many 
cases  is  easier  to  apply.  According  to  (22)  with  />  =  2,  q  =  2, 
the  condition  is  in  terms  of  the  energy  i/  the  input  signal  and 
the  energy  in  the  system  impulse  response.  For  q=  1,  p=  x, 
(22)  provides  a  bound  in  terms  of  the  peak  value  of  the  mag¬ 
nitude  of  the  transfer  function,  which  is  perhaps  most  ap¬ 
propriate  for  a  sinusoidal  input. 

F’or  the  case  of  a  random  input  (21)  cannot  be  applied 
since  the  input  and  output  do  not  have  Fourier  transforms. 
In  this  case  the  corresponding  condition  is  phrased  in  terms  of 
0„.,(«)  tile  autocorrelation  function  of  the  output,  Av/a))  the 
power  density  spectrum  of  the  input,  and  //(to)  the  magnitude 
of  the  system  function.  In  particular,  the  inequality  corre¬ 
sponding  to  (21)  is 


<M«)  <  U'  P|  4V 


(23a) 


or  equivalently 


**»(«)  <  //  (23b) 

Since,  if  the  input  is  ,’ero  mean,  </>w(0)  it  follows  that 
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<r„  <  U  »„  1*„  „.  (24) 

1  P*rt,*'nl*r  cases  of  interest  are  />=!,  «/=  «  and  6  =  oc 
</=  Iso  that 

2  •» 

<  II  ;|  <I>„  „  (25) 


*  //i;:i  *« ..  (26) 

As  Jackson  points  out,  (25)  implies  the  most  stringent  cornli- 
tion  on  the  input  spectrum  «k„(to)  whereas  (26)  implies  the 
tnost  stringent  condition  on  the  transfer  function.  From  (25) 

"  tl,c  m'n,t  sPcctrum  is  white  so  that*„(«)  =  „,*  f„r  all  «.  then 

?  O  H 

*»'  -  u  '-  (27) 

with  the  input  secpience  C.aussian.  then,  the  output  will  over- 
n"w  "«>  more  often  than  the  input  overflows  if 

H  s  <  1.  (28) 

More  generally,  (27)  provides  a  basis  for  choosing  tile  input 
variance  to  control  the  maximum  percentage  of  time  that  the 
output  can  overflow. 

D.  Statistical  Analysis  of  Roundoff  Errors  with  Floatin'. Point 
Arithmetic 

For  the  case  of  floating-point  arithmetic,  noise  is  introduced 
due  both  to  the  adds  and  the  multiplies.  In  analyzing  the 
effect  of  floating-point  roundoff  the  effect  of  rounding  will  be 
represented  multiplicatively  so  that  if  [.v)  denotes  rounding  of 
the  mantissa  in  a  floating-point  number,  then 

M  =  •'■(!  +  f).  (29) 

r°  ,'lu8t™tc  the  analysis  of  roundoff  errors  with  floating-point 
arithmetic  let  us  consider  a  first-order  filter.  Let  v(n)  denote 
the  ideal  response  of  the  filter,  that  is.  the  response  with  no 
roundoff  no.se  and  let  y(n)  denote  the  response  of  the  filter  in 
t  ie  presence  of  roundoff  noise.  Then  following  l.iu  and  Kaneko 
lsJ  we  can  write  that 

!i'(it)  =  aw(n  —  1)  +  x(ii)  ( (p) 

v(w)  =  [<0'(«  —  1)(1  -f  {„)  -f  ,v(w)](l  +  t„).  (3|) 

We  assume  that  t„  and  £„  are  uniformly  distributed  between 

l  and  2  .  are  uncorrelated  from  iteration  to  iteration  are 

independent  of  each  other,  and  also  are  independent  of'  the 
signab  Letting  E(n)  represent  the  error  in  the  output,  so  that 
that  ^  Wecan  "'rilc  from  'he  above  two  et, nations 

!■-(")  -  a  I  An  -  1) 

t 

=  a-d'(n  -  1)({„  +  £„)  -f  .v(»;)t„  =  «(„)  (32)  f 

where  we  ha  c  neglected  second-order  terms  in  t,  $,  and  F 
Since c  and  £  are  statistically  independent  of  .v,  and  of «.(„  -  1)  r 
the  term  n(n)  is  easily  shown  to  be  a  white-noise  sequence  Its  )■ 
variance,  of  course,  depends  on  the  excitation  x(n).  The  y 
derivation  of  (32)  with  the  second-order  terms  neglected  cor¬ 
responds  to  representing  the  roundoff  noise  as  an  additive 
no.se  source  that  is  statistically  independent  of  the  signal  but 
whose  variance  depends  on  the  signal  variance.  Specifically, 
consider  the  first-order  network  drawn  in  Fig.  5  with  the  two 
noise  sources  e,(n)  and  e,(«).  From  the  model  for  multiplier 


in 

..  1 


f  ig*  5.  Noisy  firsl-ordor  filler  (Healing  point). 

roundoff  noise,  the  noise  source  et («)  is  given  by 

nOO  =  av(n  -  l)f„  (33) 

and  the  noise  source  e2(«)  is  given  by 

f:(»)  =  £(«)£»•  (34) 

I  he  analysis  above  in  which  we  neglected  second-order  terms 
corresponds  in  this  case  to  evaluating  the  v  aria  nee  of  e,(n)  and 
e-A")  by  using  the  mean-square  values  fo  •  y(»-1)  and  g(n) 
that  would  result  if  no  roundoff  noise  were  present.  Therefore, 
if  we  assume  that  x(n)  is  a  zero-mean  white-noise  input,  with 
variance  <t,\  then  the  variances  of  <•,(.;)  and  r«(n)  are,  respec- 
tivelv. 


#i*  =  a-o.-yHn  -1)  =  a- a  Ac,- - 

1  —  a 2 


~  f(n)  =  <rpor- 


1  -  a- 


where  the  bar  denotes  expected  value.  Then,  since  <-,(«)  and 
e,(")  are  independent,  because  «„  and  are  independent,  the 
output  noise  variance  is 

•>  </•  1  +  a2 

’•  (1  - 

where  we  have  assumed  again  that  a,2  and  ct;2  are  equal.  The 
output  noise-to-signal  ratio  is 

<r„s  1  +  a- 

-T  =  *■*  - - -  '  (37b) 

°u  i  —  a 

\\  e  can  analyze  the  effect  of  roundoff  noise  in  the  second-order 
filter  in  a  similar  manner.  In  Fig.  6  is  shown  the  network  for  a 
second-order  filter  with  roundoff  noise  sources  included.  Note 
that  since  noise  sources  must  be  included  due  to  addition,  two 
summers  are  included  to  add  the  three  variables  in  the  feed¬ 
back  loop.  The  noise  sources  e,(»)  and  et(n)  represent  the  noise 
due  to  the  multiplies  and  the  noise  sources  c,(«)  and  c2(u)  rep¬ 
resent  the  noise  due  to  the  additions.  With  assumptions  simi¬ 
lar  to  those  above  in  which  we  neglected  second-order  terms, 
we  write  that 

<’i(«)  =  v(w)tt(n) 

f2(«)  =  l v(»l)  -  x(h)]*,(||) 

r»(»)  =  2r  cos  6v(n  -  1)«„(«) 

f'(")  =  ~  r~y(ii  ~  2 )«,(»)  (38) 
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•ft*)  mfc) 


1ln'" 


T 

e4(n) 

I'll!-  6.  Noisy  second-order  filter  (floating  jMitni). 

wHeru  «,,  e2,  «a,  and  e.  are  independent  random  variables  with 
equal  variance  »,*.  If  as  before.  x(n)  is  assume,!  to  be  a  white 
random  process  with  variance  o,\  then  the  output  noise-to- 
siRnal  ratio  for  the  second-order  case  is 

v  =  <T<2  [l  +  G  O'4  +  Ufl  cos2 6  ~ 16 jq~r)]  (*w) 

where 

G=l+'7 _ l  \ 

1  -  r'1  \r*  +  1  -  cos*®  +  2 r2/  ^ 

For  the  high  gain  case,  it  is  possible  to  compare  lived -point 
and  floating-point  arithmetic  by  approximating  the  expres¬ 
sions  for  the  noise- to-signal  ratio.  For  the  first-order  case 
w„h  <i  =  l  —  5,  an(,  5«1,  ,he  result  (37b)  for  the  first-order 
niter,  can  lie  approximated  as 


1  1 

-  ~  -  2^26  - 
3  5 


Similarly  for  the  second-order  filter 


-2-2‘(3  +  4c"s2e\ 

3  \  45  sin2  e  ) 


where  in  (41)  and  (42)  we  have  taken  <t,2  =  (1  3)2-j4. 

For  fixed-point  arithmetic  we  recall  that  for  a  white-noise 
input  the  noise-to-signaf  ratio  behaved  as  1  52  and  for  a  sinus- 
oidal  input  as  1  5  Comparison  of  (41)  and  (42)  with  (14)  and 
16)  and  (18)  and  (20)  indicates  a  slightly  larger  noise-to-sig- 
nal  ratio  for  floating-point  arithmetic  as  compared  with 
xid-point  arithmetic  with  a  sinusoidal  input  of  known  fre¬ 
quency  but  a  significantly  smaller  noise-to-signal  ratio  for 
floating-point  arithmetic  as  compared  with  fixed-point  arith- 
nu  tic  with  a  white-noise  input.  It  is  important  to  keep  in 
niind.  however,  that  the  noise-to-signal  ratios  for  the  fixed- 
fibers  were  computed  on  the  basis  that  the  input  signal 
was  as  large  as  possible  If  the  input  signal  level  decreases,  the 
noise-to-signal  ratio  will  increase  since  the  output  noise  vari¬ 
ance  is  independent  of  the  input  signal  level.  For  floating-point 

tl,c  ,,tl,('r  ''and,  the  output  noise  variance  is 
proportumal  to  the  output  signal  variance  and  as  the  input 
is  scaled  up  or  down  so  is  the  roundoff  noise.  It  is  also 

tlTt  r,|a,Uf|,°  "°U’  t,m  t,K'  t<,n,l,aris,,n  just  discussed  assumes 
at  the  Hoating-po.nt  mantissa  is  equal  in  length  to  the 
entire  fixed  point  word,  and  does  not  account  for  the  extra 
hits  needed  for  the  characteristic.  The  authors  (6]  have  previ- 


"UM\  t,,1,,,,;|red  fixed-  and  floating-point  filters  on  the  basis  of 
equal  total  word  length.  However,  in  completing  such  a 
comparison  one  must  take  account  of  the  large  difference  in 
hardware  complexity  between  implementing  floating-point 

element tlC'  *  fcW  hits  *  fixed-point  arithmetic 

Oppeiiheim  |<>]  has  proposed  a  realization  of  recursive 
digital  filters  using  block  floating-point  arithmetic.  Here  the 
input  and  filter  states  (i.e.,  the  inputs  to  the  delay  registers) 
are  jointly  normalized  before  the  multiplications  and  addi 
turns  are  performed  in  fixed-point  arithmetic.  The  scale  factor 
(or  exponent)  obtained  during  the  normalization  is  then  ap¬ 
plied  to  the  final  output  to  obtain  a  fixed-point  output.  The 
roundoff  noise  properties  of  such  a  realization  were  studied 
and  the  no.se-to-s.gnal  ratio  was  found  to  lie  between  that 
lor  fixed  mid  floating  point. 

•'  l:;.  Zer°-h,Pul  LimH  Cycle  Behavior  of  Digital  Filters  for 

e  rixed-Patitt  Arithmetic 

-  In  the  preceding  discussion  the  effect  of  arithmetic  round¬ 
off  was  modelled  as  an  additive  white-noise  source,  u.icorre- 
ated  with  the  data.  Justification  of  this  model  assumes  that 
)  from  iteration  to  iteration,  the  input  can  be  expected  to  pass 
through  several  quantization  levels.  Consequently,  this  model 
is  applied  primarily  when  the  input  signal  has  a  complicated 
behavior  and  cannot  be  expected  to  be  valid  in  general  For 
example,  consider  a  first-order  filter  for  which  the  difference 
equation  is 

v„  =  ay*-!  -f  ,v„  (43) 

and  for  which  the  register  length  for  the  data  is  4  bits  and  the 
coefficient  or  is  0.5.  If  the  input  ,v„  is  7/8  and  if  rounding  is 
applied  after  the  arithmetic  then  on  successive  iterations  of 
the  filter,  the  output  will  be: 

>’o  =78 

y,  =  1/2 

Vs  =  1/4 
y3  =  1  8 

v*  =  1  8,  for  n  >  4. 

Thus  due  to  rounding,  the  output  reaches  a  steady-state  non¬ 
zero  value  and  since  the  ideal  steady-state  output  is  zero,  this 
nonzero  value  represents  roundoff  error.  Clearlv  this  kind  of 
roundoff  error  cannot  be  modelled  as  white  noise,  hut  in  fact 
re, .resents  a  limit  cycle  due  to  the  nonlinearity  corresponding 
to  the  quantizer  which  implements  the  rounding.  Limit  cvcle 
behavior  of  this  type  was  first  noted  by  Blackman  flO]  who 
referred  to  the  amplitude  intervals  within  which  these  limit 
cycles  are  confined  as  “deadbands.”  Blackman  considered  * 
only  first-order  limit  cycles  corresponding  to  a  dc  behav  ior  in 
the  deadband.  More  generally.  Jackson  [l  1  ]  has  considered 
limit  cycle  behavior  in  first-  and  second-order  filter  sections  * 
will,  an  analysis  based  on  the  location  of  the  “effective"  poles 
tlu;  h,tcr  1,1  roundoff.  Following  the  approach  pre¬ 
sented  by  Jackson,  consider  a  first-order  filter  with  a  differ 
ei.ee  equation  of  the  form  of  (43).  Due  to  the  register  length 
constraint,  the  product  av„_,  must  be  rounded.  Let  (•)' 
denote  the  operation  of  rounding.  If  the  register  length  is 
'*+!)  bits  and  if  data  are  represented  as  fractions  then 

(«y— ,)'  -  «y*-i  <  G)2~6.  (44) 

If  V„-1  is  such  that  (ay, =  |y„_,|  then  the  magnitude  of 
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tlic  effective  value  of  the  coefficient  is  unity  corresponding  to 
the  pole  of  the  filter  being  on  the  unit  circle.  The  range  of 
values  for  which  this  condition  is  met  is 

v„  i  —  <vy„  i  <  (.|,)2  b  (45) 

or 


This  range  of  values  is  referred  to  as  the  deadband.  Due  to 
rounding,  of  course,  values  within  the  deadband  must  be  in 
steps  of  2~k.  For  the  first-order  filter,  when  the  filter  state 
falls  within  this  range  and  the  input  is  zero,  the  effective  pole 
is  on  the  unit  circle  and  the  filter  will  support  a  limit  cycle 
behavior.  If  the  coefficient  a  is  positive,  as  in  the  above  ex¬ 
ample,  the  limit  cycle  response  is  dc,  i.e.,  has  constant  mag¬ 
nitude  and  sign.  For  a  negative  the  limit  cycle  behavior  litis 
constant  magnitude  but  alternating  sign. 

For  a  second-order  filter  there  is  a  larger  variety  of  modes 
of  limit  cycle  behavior.  In  particular,  consider  the  second- 
order  difference  equation 

y„  =  ,v„  -  fty„_t  —  fty„_.,  (47) 

With  ft2 <  -F4ft  the  filter  poles  i  •ccttr  as  a  complex  conjugate 
pair  and  with  ft=l  the  poles  occur  on  the  unit  circle.  The 
approach  proposed  by  Jackson  for  examining  the  limit  cycle 
behavior  of  the  second-order  filter  corresponds  to  considerin'; 
the  filter  behavior  when  the  effect  of  rounding  places  the  effec¬ 
tive  poles  of  the  filter  on  the  unit  circle.  With  zero  input  the 
effective  poles  will  be  on  the  unit  circle  if 

Vn-s  -  (ds V,,-.)  |  <  J  2  b  (48) 

or 


(0.5)2  » 


Thus  if  the  output  falls  within  this  range  the  effective  value  of 
ft-’  is  unity  so  that  the  effective  poles  are  oil  the  unit  circle. 
With  the  effective  value  of  ft..  as  unity,  the  effective  v  alue  of 
fti  controls  the  oscillation  frequency. 

A  second  mode  of  limit  cycle  behavior  occurs  in  second- 
order  filters  when  the  effect  of  rounding  is  to  place  an  effective 
pole  at  :=  ±1.  As  shown  by  Jackson,  the  deadband  corre¬ 
sponding  to  this  mode  is  for  values  less  than  or  equal  to 
1  (1—  /3i  4/3  )  in  steps  of  integer  multiples  of  2 

While  this  approach  is  somewhat  heuristic,  Jackson  has 
found  that  these  bounds  are  consistent  with  experimental 
results  and  hence  he  has  hypothesized  tln.t  tluv  represent 
necessary  and  sufficient  conditions.  These  bounds  for  second- 
order  filters  are  summarized  in  Fig.  7,  showing  different  dead¬ 
band  subregions  in  the  fti,  ft-;  plane.  The  number  within  an 
area  in  the  fti,  ft  plane  represents  the  maximum  magnitude  of 
the  limit  cycle  in  multiples  of  2  hand  the  cross  hatched  region 
represents  the  region  for  which  no  limit  cycles  can  occur. 

Recently,  Parker  and  Hess  [  1 2 ]  have  studied  the  limit 
cycle  problem  further,  and  found  that  these  hounds  are  ap¬ 
proximate!}'  correct  and  sufficient,  but  not  necessary.  In 
other  words,  there  exist  some  limit  cycles  outside  the  regions 
specified  by  Fig.  7. 

In  addition  to  the  above  classes  of  limit  cycles,  a  more 
severe  type  of  limit  cycle  can  occur  due  to  overflow  in  filters 
implemented  using  one’s-coniplement  or  tvvo's-complement 


y„_» |  < 


Fig.  7.  Deadband  subregions. 


arithmetic.  These  limit  cycles  have  been  referred  to  as  over- 
flow  oscillations  [  13 ]  and  can  be  avoided  by  using  saturation 
arithmetic. 

F.  Effects  of  Parameter  Quantization  in  Digital  Filters 

In  the  preceding  sections  we  focussed  on  the  effects  of 
arithmetic  roundoff  in  digital  filters.  Another  consequence  of 
the  requirement  of  finite  register  length  is  that  the  filter  coeffi¬ 
cients  cannot  be  specified  exactly.  Classical  design  procedures 
generally  lead  to  filter  coefficients  with  arbitrary  an  uracy  and 
the  implementation  of  the  filter  then  requires  that  the  coeffi¬ 
cients  be  modified  to  fit  the  available  register  length.  For  hard¬ 
ware  realizations  of  digital  filters  it  is,  of  course,  desirable  to 
keep  the  register  length  as  small  as  possible. 

One  common  approach  to  the  problem  of  parameter  quan¬ 
tization  is  tile  use  of  filter  configurations  or  structures  which 
in  some  sense  are  least  sensitive  to  inaccuracies  in  the  param¬ 
eters.  One  of  the  difficulties  in  evaluating  the  sensitivity  of 
filter  structures  is  the  choice  of  a  meaningful  measure  of  the 
sensitivity.  Most  commonly,  the  sensitiv  ity  of  the  filter  is  tied 
to  the  movement  of  the  poles  of  the  filter.  For  this  choice 
Kaiser  |14]  has  sl’iiwn  that  for  a  filter  with  clustered  poles  a 
cascade  or  parallel  combination  of  first-  and  second-order  sec¬ 
tions  provides  more  accuracy  in  the  pole  positions  than  a 
direct  form  realization.  This  is  basically  a  consequence  of  the 
fact  that  for  a  polynomial  whose  roots  are  clustered,  the  sensi¬ 
tivity  of  the  roots  to  changes  in  the  polynomial  coefficients 
increases  as  the  order  of  the  polynomial  increases.  Thus  the 
roots  can  be  more  accurate'}'  controlled  if  the  polynomial  is 
factored  into  first-  and  second-order  factors. 

I'-ven  within  the  choice  of  first-  and  second-order  sections 
some  flexibility  remains.  For  a  direct  form  implementation  of 
a  pole  pair  as  shown  in  Fig.  8(a)  the  coefficients  are  —r2  and 
2r  cos 6.  loir  a  given  quailtizuti  n  on  the  coefficients  the  poles 
must  lie  on  a  grid  in  the  z  plane  defined  by  the  intersection  of 
concentric  circles,  corresponding  to  quantization  of  r2  and 
vertical  lines,  corresponding  to  quantization  of  2r  cos  6.  Such 
a  grid  is  illustrated  in  Fig.  8(b).  An  alternative  realization  of 
a  pole  pair  is  the  coupled  form  proposed  In  Rader  and  Gold 
[15],  as  shown  in  Fig.  9(a).  In  this  case  the  coefficients  are 
r  cos  6  and  r  sin  0  and  consequently  the  poles  must  lie  on  a 
rectangular  grid  as  illustrated  in  Fig.  9(b).  We  note,  for  ex¬ 
ample.  that  for  a  given  coefficient  word  length  the  direct  form 
permits  more  accurate  placement  of  poles  with  r  close  to  unity 
nndfl  large  while  the  coupled  form  is  more  advantageous  for  6 
small.  There  are,  in  theory,  many  other  structures  in  addi¬ 
tion  to  the  direct  and  coupled  forms  for  implementing  pole 
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pairs  although  they  are  the  most  commonly  considered  Il6l. 
Different  structures,  of  course,  imply  different  grids  in  the  c 
plane  and  generally  it  is  advantageous  to  choose  a  structure 
or  " hit'll  the  grid  ts  dense  in  the  region  of  the  s  plane* where 
the  poles  are  to  he  located.  yug 

With  a  given  choice  of  structure  there  remains  the  ques¬ 
tion  as  to  how  the  pole  locations  on  the  grid  should  he  chosen. 
.  common  procedure  is  to  truncate  or  round  the  ideal  coeffi¬ 
cients  An  alternative  used  by  Avenhaus  and  Sclmssler  [17] 
and  also  by  Ste.ghtz  [18]  is  to  search  over  the  grid  in  the  vi- 
unit>  of  the  ideal  pole  locations  to  select  a  grid  point  which 
ocal  >  minimizes  the  maximum  error  in  the  filter  frequency 
response.  As  an  alternative  to  the  use  of  cascade  or  parallel 
connections  of  first-  and  second-order  sections,  more  general 
■  Iter  structures  can  lie  considered.  Digital  wave  filters,  as 
proposed  by  hettweis  19]  and  investigated  by  Bingham  [2()j 
and  by  Croch.ere  [21  ],  appear  to  have  much  less  sensitivity 
to  parameter  inaccuracies  than  the  cascade  form. 

It  would,  of  course,  be  desirable  to  incorporate  the  con¬ 
straint  of  quantized  coefficients  into  the  design  of  digital  fil¬ 
ers  For  nonrecursive  filters,  algorithmic  design  falls  within 
ic  framework  of  integer  linear  programming.  For  recursive 
hlters,  however,  the  equations  become  nonlinear.  In  general 
the  development  of  design  procedures  with  quantized  coeffi¬ 
cients  remains  an  important  area  of  research. 


I\.  Effects  of  Arithmetic  Roundoff  in  the  FFT 
1  Introduction 


e  EFT  algorithm  [22]  for  computing  the  discrete  Fou¬ 
rier  transform  DPT)  plays  a  central  role  in  many  signal  pro¬ 
cessing  applications  [23],  As  with  the  implementation  of 
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digital  biters,  it  is  important  to  understand  the  effect  of  finite 
register  length  arithmetic  on  the  performance  of  the  nlgo- 
rithm. 

I  here  are  main  forms  of  the  FFT  algorithm  and  the  de¬ 
tailed  effects  of  quantization  will  differ  depending  onfthc 
form  used.  I  he  most  cominoi  Iv  used  forms  of  the  algorithm 
are  the  radix-2  forms  for  whicl  the  size  of  the  transform  com¬ 
puted  is  an  integer  power  of  two.  For  the  most  part,  the  dis¬ 
cussion  below  is  phrased  in  ter  ns  of  a  particular  form  of  the 
radix-2  FFT,  commonly  referred  to  as  the  decimation  in  time 
form  of  the  algorithm;  the  results  however  are  applicable  with 
only  minor  modification  to  the  decimation  in  frequency  form. 
We  feel  that  most  of  the  ideas  employed  in  the  error  analysis 
of  the  radix-2  forms  of  the  algoritl  m  can  be  utilized  in  other 
forms  such  as  mixed  radix,  etc. 

Our  approach  in  analyzing  noise  in  the  FFT  is  basical' 
statistical.  In  most  cases,  the  predictions  of  the  models  are 
supported  with  experimental  data  (from  Weinstein  [24],  un¬ 
less  otherwise  stated).  For  floating  and  block  floating  point 
arithmetic,  in  order  to  simplify  the  analysis  and  obtain  con¬ 
crete  results,  it  is  convenient  to  assume  a  simple,  white-noise 
model  for  the  signal  being  transformed.  Discussion  of  how  the 
results  might  he  expected  to  change  for  other  types  of  signals 
is  included,  as  arc  experimental  noise  measurements  on  FFT’s 
of  nonwhite  signals. 


K.  The  FFT  Algorithm 

The  FFT  algorithm  is  directed  toward  computing  the 
DPT  of  a  finite  duration  sequence/(«),  defined  as 
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.v  1 

i'\k)  =  2/(»)ir-*,  ii  =<•-'>*' v)t  (5o) 

n-=0 

A  flow  chart  depicting  the  FFT  algorithm  for  A’  =  8  =  23  is 
shown  in  Hg.  10.  A  specific  decimation  in  time  algorithm  is 
depicted.  (An  implementation  of  this  particular  form  of  the 
algorithm  was  used  for  the  reported  experimental  work.) 
Some  key  aspects  of  this  diagram,  which  are  common  to  all 
standard  radix-2  algorithms,  are  as  follows.  The  I)KT  is  com¬ 
puted  in  v  —  log-*  A  stages.  At  each  stage,  the  algorithm  passes 
through  the  entire  array  of  A’  complex  numbers,  two  at  a 
time,  generating  a  new  A'  number  array.  The  eth  array  con¬ 
tains  the  desired  DPT.  I  lie  basic  numerical  computation 
operates  on  a  pair  of  numbers  in  the  (»i  +  l)th  array.  This 
computation,  referred  to  as  a  “butterfly”  is 

-v*s.(0  =  yji)  H-  IFA ■„,(]) 

•YmuO’)  =  -Ym(0  -  H  A -m(j).  (51) 

Here,  Am(t)  and  A  m(j)  represent  a  pair  of  numbers  in  the  with 
a  ray,  and  If  is  some  appropriate  integer  power  of  II',  that  is 

IP  =  1| 'i-  =  e-j?*v  v  (52) 

The  form  of  the  butterfly  computation  is  actually  somewhat 
different  for  a  decimation  in  frequency  algorithm,  where  the 
computation  is 

Amtl(0  =  A  nt(0  +  A ’m(j) 

-Vmn0')  =  [  Y„,(i)  -  .Ym0')]lP.  (53) 

At  each  stage,  A  2  separate  butterfly  computations  are  car¬ 
ried  out  to  produce  the  next  array.  The  integer  p  caries  with 
i,j,  and  ni  in  a  manner  which  depends  on  the  specific  form  of 
the  PPT  algorithm  that  is  used.  Fortunately,  our  analysis  is 
not  tied  to  the  specific  way  in  which  p  varies.  Also,  the  specific 
relationship  between  i,  j,  and  which  determines  how  we 
index  through  the  with  array,  is  not  important  for  the  anal¬ 
ysis.  I  he  details  of  the  analysis  for  decimation  in  time  and 
decimation  in  frequency  differ  somewhat  due  to  the  different 
butterfly  forms,  but  the  basic  results  for  the  dependence  of 
noise-to-signal  ratio  on  N  do  not  change  significantly.  In  our 
analysis  we  will  assume  a  butterfly  of  the  form  (51),  corre¬ 
sponding  to  decimation  in  time. 


C.  FFT  Roundoff  Noise  with  Fixed-Point  Arithmetic 

We  will  model  the  roundoff  noise  by  associating  an  inde¬ 
pendent  white-noise  generator  with  each  multiplier  This 
means  that  a  noise  source  feeds  into  each  node  of  the  signal 
flow  graph  of  Fig.  10  (excluding  the  initial  array  of  nodes, 
since  we  are  not  considering  A  I)  noise  here)  Since  we  are 
dealing  with  complex  multiplications,  these  elemental  noise 
sources  are  complex.  Defining  the  complex  variance  on1  as  the 
expected  squared  magnitude  of  such  a  noise  source,  we  have 

on 2  =  4  (54) 

12 

where  it  is  assumed  that  each  of  the  four  real  multiplications 
used  to  perforin  the  complex  multiplication  is  rounded  sepa¬ 
rately.  In  Fig.  10,  3X8  =  24  such  noise  sources  must  be  in¬ 
serted.  To  add  the  effects  of  each  of  the  noise  sources  in  eval¬ 
uating  the  total  roundoff  noise  in  the  output  we  note  that  the 
transmission  function  from  any  node  in  the  flow  graph  to  ant- 
other  connected  node  is  multiplication  ty  a  complex  constant 
of  unity  magnitude.  Since  we  assume  that  all  noise  sources  are 
uncorrelated,  the  noise  variance  at  any  output  node  is  equal  to 
oiP  times  the  number  of  noise  sources  that  propagate  to  that 
node.  The  general  result  which  is  easily  verified  for  the  case 
A'  =  8  by  ins|iection  of  Fig.  10  is  that  (A'— 1)  noise  sources 
propagate  to  each  output  node  so  that  the  output  noise  vari¬ 
ance  ok*  is  given  by 

ok'  =  (  Y  —  l )o/r 

which  for  large  A'  we  take  as 

<rK-  ~  A  air.  (55) 

According  to  this  result,  the  variance  of  the  output  noise  is 
proportional  to  A',  the  number  of  points  transformed.  The 
effect  of  doubling  A’,  or  adding  another  stage  in  the  FFT,  is  to 
double  the  output  noise  variance.  I'sing  the  assumptions  we 
have  made  thus  far  about  the  noise  generators  in  the  FFT  (all 
uncorrelated,  with  equal  variances),  the  output  noise  is  white, 
i.e.,  the  A'  noise  samples  E(k)  are  mutually  uucorrelated,  with 
independent  real  and  imaginan  parts.  This  follows  from  the 
fact  that  the  output  of  any  butterfly  is  white  (two  outputs  un¬ 
correlated  with  equal  variance,  real  and  imaginary  parts 
uncorrelated)  if  the  input  is  white.  Since  the  noise  sources  in 
our  system  are  white,  and  all  connected  to  the  output  via 
some  combination  of  butterfly  computations,  the  output  noise 
must  also  be  white. 

In  order  to  simpFfv  the  analysis  leading  to  (55),  we  have 
neglected  some  details.  First,  we  have  associated  equal  vari¬ 
ance  noise  sources  with  all  multipliers,  including  where  H'=  1 
and  j.  In  many  programmed  FFT's  these  multiplications  are 
performed  noiselessly.  If  we  assume  in  the  analysis  that  these 
multiplications  are  noiseless,  the  output  noise  variance  will  no 
longer  be  uniform  over  the  output  array.  For  example,  the 
zeroth  output  point  would  be  noiseless.  The  average  variance 
over  the  output  array  will  be  somewhat  lower  than  the  result 
in  (55),  but  will  retain  a  linear  dependence  on  A’.  Second,  the 
assumption  that  all  noise  sources  are  uncorrelated  is  contra¬ 
dicted  by  the  fact  that  the  two  noise  sources  associated  with  a 
given  butterfly  are  negatives  of  each  other,  and  therefore  com¬ 
pletely  correlated.  This  does  not  affect  the  result  for  output 
noise  variance,  since  the  two  outputs  of  a  butterfly  connect  to 
a  disjoint  set  of  output  points.  However,  it  implies  that  the 
output  noise  samples  E(k)  are  somewhat  correlated.  These 
details  are  worth  mentioning,  but  not  worth  analyzing  here  at 


leant h ,  because  they  cloud  the  essential  ideas  of  the  analysis, 
arc  quite  program-dependent,  and  do  not  change  the  essential 
character  of  the  dependence  of  mean-squared  output  noise 
on  X. 

In  implementing  the  F'F'T  with  fixed-point  arithmetic  we 
must  insure  against  overflow  From  (51)  it  follows  that 

max  [  ]  .Y ,„(/)  |  ,  XJj)  j 

<  max  [  -Y,,h(i)  ,  |  ]  (56) 

and  also  that 

max  [|  .Y„,tl(i)  j  ,  .Y j  ] 

<  2  max  [  Xm(i)  \  ,  j  X m(j)  \  j.  (57) 

Equation  (56)  implies  that  the  maximum  modulus  is  i.on- 
decreasing  from  stage  to  stage  so  that,  if  the  magnitude  of  the 
output  of  the  FFT  is  less  than  unity  then  the  magnitude  of 
the  points  in  each  array  must  be  less  than  unity,1  i.e.,  there 
will  be  no  overflow  in  any  of  the  arrays. 

In  order  to  express  this  constraint  as  a  bound  on  the  input 
sequence,  we  note  that  the  maximum  possible  output  can  be 
expressed  in  terms  of  the  maximum  input  as 


A’(  k)  | . .  <  ..•(«) 


=  -V  |  x(n) 
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2  _  2 

OF  =  OH'1 


where  an  2  represents  the  roundoff  noise  introduced  due  to 
multiplication  by  IT  and  scaling  and  will  consequently  be 
slightly  higher  than  an'1.  In  particular,  if  we  assume  that  the 
scaling  is  accomplished  with  rounding,  it  ran  be  shown 


an  2  =  -  2  2ft. 
6 


Thus  bounding  the  input  sequence  so  that 

-r(«) !  <  l/.Y  (59) 

will  prevent  overflow.  To  obtain  an  explicit  expression 
for  output  signal  variance,  we  assume  x(n)  white,  with 
real  and  imaginary  parts  each  uniform!)  distributed  in 
(—1  y  2A',  1  v'2-V).  Then  we  have 

tvs  =  \X(W-  =  XaS  =  Y  |  .r(«)  3  =  ^  ■  •  (60) 

Combining  this  with  (55)  yields 
2 

=  5-Y  "an-.  (61) 

ax' 

The  assumption  of  white  input  signal  is  not  critical  here.  For 
example,  if  a  complex  sinusoid  ,v(»)  =  (1  .V)  exp  j(2irk\n  X  +<t>) 
bad  been  selected  a f;2  a\-  would  still  be  proportional  to  A’2, 
which  is  the  essential  point  of  (61). 

Equation  (57)  suggests  an  alternative  procedure  for  pre¬ 
venting  overflow.  Since  the  maximum  modulus  increases  by 
no  more  than  a  factor  of  two  from  stage  to  stage  we  ran  pre¬ 
vent  overflow  by  requiring  that  ,v( n)  <1  and  incorporating 
an  attenuation  factor  of  1  2  at  each  stage.  Using  this  step-by- 
step  scaling,  the  attainable  output  signal  level  (for  white 
input)  is  the  same  as  in  (60)  since  the  output  signal  level  does 
not  depend  on  where  the  scaling  is  done,  but  only  on  how 
much  overall  scaling  is  done.  However,  the  output  noise  level 
will  be  much  less  than  in  (55)  since  the  noise  introduced  at 
early  stages  of  the  FFT  will  be  attenuated  by  the  settling 
which  tabes  place  at  the  later  array.  Quantitatively  for  X  =  2’ 


1  Actually  nm-  should  discuss  overflow  in  teims  of  the  real  and  imagi¬ 
nary  parts  of  the  data,  rather  than  the  magnitude.  However,  x  <1 
Implies  that  Ke  (*)  <t  and  Im  (x)  <t,  and  only  a  slight  increase  in 
allowable  signal  level  is  achieved  by  scaling  on  the  basis  of  Re  and  1m 
parts. 


For  large  X,  (62)  is  approximately 


and  thus  is  much  less  than  the  noise  variance  resulting  when 
all  of  the  scaling  is  carried  out  on  the  input  data. 

Now,  we  can  combine  (64)  with  (60)  to  obtain  the  output 
noisc-tn-signal  ratio  for  the  case  of  step-by-step  scaling  and 
white  input.  \Ye  obtain 

ok1 

- =  6.\t r«<*  =  (5.Y)2  26  (65) 

<r.v! 

a  result  proportional  to  X,  rather  than  to  AT  An  interpreta¬ 
tion  of  (65)  is  that  the  mis  output  noisc-tn-signal  ratio  in¬ 
creases  as  X,  or  by  half  a  bit  per  stage.  This  result  was  first 
obtained  by  Welch  [25 ].  It  is  important  to  note  that  the  as¬ 
sumption  of  white  signal  is  not  essential  in  the  analysis.  The 
basic  result  of  half-a-bit-per-stage  increase  holds  for  a  broad 
class  of  signals,  with  only  the  constant  multiplier  in  (651 
being  signal-dependent.  In  particular,  for  a  general  input  with 
scaling  at  each  array,  the  output  variance  is  related  to  the 
variance  of  the  input  array  by 


ax'  -  a, 

s 


•v(»)  5 


so  that 


-  ,Y2  n 
an'  5 


where,  to  reduce  noisc-tn-signal  ratio,  we  would  like  to  make 
a,2  as  large  as  possible  but  are  limited  bv  the  constraint 
|.v(»)|  <1.  The  result  (67)  lias  been  verified  experimentally 
for  both  wide-hand  and  narrow  band  signals  1 24 ] ,  1 2 5 ] . 

\Ye  should  also  note  that  the  dominant  factor  causing  the 
increase  of  <r/;2  a.y2  with  X  is  the  decrease  in  signal  level 
(required  by  the  overflow  constraint)  as  we  pass  from  stage  to 
stage.  According  to  (64)  and  (64),  very  little  noise  (only  a  bit 
or  two)  is  present  in  the  final  array  Most  of  the  noise  has 
been  shifted  off  by  the  sealings.  However,  the  mean-squared 
signal  level  has  decreased  by  a  factor  of  1  A  from  its  initial 
value,  due  to  the  sealings.  Our  output  consists  not  of  the  DFT 
defined  by  (50)  but  of  l/.Y  times  this  DFT. 

We  have  assumed  straight  fixed  point  computation  in  this 
section,  i.e.,  only  preset  attenuations  were  allowed,  and  we 
were  not  permitted  to  rescale  on  the  basis  of  an  overflow  test 
Clearly,  if  the  hardware  or  programming  facility  arc  such  that 
straight  fixed  point  must  be  used,  we  should,  if  possible,  incor¬ 
porate  attenuators  of  1/2  at  each  array  r„tber  than  using  a 
large  r.ttenuation  of  the  input  array. 
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Fig.  11.  Kxperimcm.il  and  theoretical  noi w-lo-signat 
rnlios  lor  block  flouling-poinl  FFT. 


A  third  approach  to  avoiding  overflow  is  the  use  of  block 
floating  point.  In  this  procedure  the  original  array  is  normal¬ 
ized  to  the  far  left  of  the  computer  word,  with  the  restriction 
that  ,v(»)j  <1:  the  computation  proceeds  in  a  fixed  point 
manner,  except  that  after  every  addition  there  is  an  overflow 
test;  if  overflow  is  detected,  the  entire  array  is  shifted  right  t 
hit  and  the  computation  continues.  The  number  of  necessary 
shifts  are  counted  to  determine  a  scale  factor  or  exponent  for 
the  entire  final  array.  The  output  noise-to-signal  ratio  depends 
strongly  on  how  many  overflows  occur,  and  at  what  stages  of 
the  FFT  they  occur.  The  positions  and  timing  of  overflows 
are  determined  by  the  signal  being  transformed,  and  thus,  in 
order  to  analyze  noise-to-signal  ratio  in  block  floating  FFT, 
one  needs  to  know  the  signal  statistics.  This  is  in  contrast  to 
the  fixed  point  analysis  above,  where  it  was  not  necessary  to 
assume  specific  signal  statistics. 

The  necessary  number  of  right  shifts  of  the  array  is  related 
to  the  peakiness  of  the  DFT  of  the  signal  being  transformed. 
If  the  constant  signal  x(n)  =  1  or  the  single-frequency  input 
x(n)  —  exp  j(2r/N)k0n  is  transformed,  the  output  (with  k0  an 
integer)  will  consist  of  a  single  nonzero  point  and  (for  N  =  2")  v 
scalings  of  the  array  will  be  necessary,  one  for  each  stage. 

A  reasonable  case  to  examine  is  the  case  of  a  white  input 
signal;  the  DFT  of  a  white  signal  is  white,  and  one  might 
expect  (since  the  spectral  energy  is  spread)  that  scalings  at  all 
stages  would  not  be  necessary,  and  a  noise-to-signal  ratio 
advantage  over  fixed  point  would  be  gained.  This  problem 
can  be  analyzed  theoretically  [24]  but  the  analysis  is  quite 
involved  and  will  be  omitted.  Instead,  we  will  present  some 
experimental  results. 

In  Fig.  It  experimentally  measured  values  of  output 
noise-to-signal  ratio  are  presented  for  block  floating  FFT'sof 
white  inputs,  using  rounded  arithmetic.  The  quantity  plotted 
is  (an1  2  2hUi‘iy1'1,  the  rms  noise-to-signal  ratio.  For  com¬ 
parison.  a  theoretical  curve  representing  fixed  point  noise-to- 
signal  ratio  (for  rounded  arithmetic)  is  also  shown.  We  see 
that  for  white  input  block  floating  point  provides  some  ad¬ 
vantages  over  fixed  point,  especially  for  the  larger  transforms. 
For  N=  2048,  the  rms  noise-to-signal  ratio  for  block  floating 
point  is  about  1  8  that  of  fixed  point,  representing  a  3- bit 
improvement. 

An  experimental  investigation  was  used  to  examine  how 


Fig.  12.  Noisy  butterfly  compulation  (floating  poinl). 


the  results  for  block  floating  point  change,  when  truncation 
rather  than  rounding  is  used.  The  results  of  this  experiment 
are  also  shown  in  Fig.  11.  Noise-to-signal  ratios  are  generally 
a  hit  or  two  worse  than  for  rounding.  The  rate  of  increase  of 
noise-to-signal  ratio  with  A’  seems  to  be  about  the  same  as  for 
rounding. 

I).  FFT  Roundoff  .Xoise  with  Floating-Point  Arithmetic 

The  effect  of  arithmetic  roundoff  with  floating-point  arith¬ 
metic  has  been  analyzed  theoretically  and  experimentally  by 
Gentleman  and  Sande  [26],  by  Weinstein  [27],  and  by 
Kaneko  and  Liu  |28],  As  with  the  statistical  analysis  of 
roundoff  errors  with  fixed-point  arithmetic,  noise  is  intro¬ 
duced  due  to  each  butterfly  computation.  As  with  floating¬ 
point  errors  in  digital  filters,  we  neglect  second-order  error 
terms  so  that  noise  sources  are  introduced  after  each  multi¬ 
plication  and  addition  that  are  assumed  to  he  white  but  for 
which  the  variance  is  proportional  to  the  variance  of  the  sig¬ 
nal  at  that  node.  Unless  the  input  signal  is  assumed  to  be 
white,  the  analysis  becomes  quite  complicated  due  to  the 
variation  of  the  variance  of  the  signal  and  therefore  of  the 
noise  sources  within  each  array.  Kaneko  and  Liu  have  ob¬ 
tained  detailed  formulas  for  a  general  stochastic  model  of  the 
input  signal.  We  will  confine  attention  here  to  the  case  of 
white  input  signal,  where  the  signal  at  any  array  in  the  FFT 
is  also  white,  with  constant  variance  across  the  array. 

In  Fig.  12  a  typical  butterfly  computation  (only  top  half) 
is  indicated,  including  the  noise  sources  due  to  multiplication 
and  addition.  The  assumption  of  white  input  signal  implies 
that 


I  Re  (.Vm)]-  =  [Im  (.Vm)]2  =  |  |  .Ym|2  (68) 

and  application  of  our  floating-point  noise  model  as  in  Section 
1II-D  yields  the  noise  source  variances 

crr,2  +  =  <r,52  +  =  <r„2  =  o',,2  =  5  nr,A«  *  (69) 

a,., 2  =  a,,2  =  <rf2|.Y„|2.  (70) 
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The  \  ariance  of  the  complex  noise  source  Um  =  /<m  +  jvM  is  then 


|fM;s  =  4ff.»|.Y„|*  (71) 

so  that  the  variance  of  tile  noise  generated  in  computing  the 
(w-f-l)th  arr.i\  is  4<x,2  times  t1  e  variance  of  tile  signal  in  the 
mth  array  If  the  input  (zeroth)  array  is  white  noise  with 
variance  a*  then  the  noise  generated  in  the  (w  +  l)th  array  is 
2'”<x,2(4<x,2)  If  tj„m‘  is  the  output  noise  clue  to  the  noise  gen¬ 
erated  in  the  (m  +  l)th  array,  then 

Com-  =  2'-<”+'>2"c7,2(4<7,:!)  =  2AV.V,2.  (72) 

Since  the  noise  generated  in  each  array  is  assumed  to  he  inde¬ 
pendent,  the  total  output  noise  variance  a*1*  is 

otr  =  2v  AV.V/*.  (72) 

By  noting  that  the  output  signal  variance  is  related  to  the 
input  signal  variance  hy 


o.\"  =  \<Jj- 


Thtorttical  ftimplifltd  analytk) 
Thtortticol  (modified  analysl*) 
Eipdrimtnlal  (randamiiad  rounding) 
Experimental  (nanrandamlted  rounding) j 
Filled  Quadra  lie  Curve  P 


the  result  follows: 


.» — i — i _ 

•  r 


;  =  la<-v-  (75) 

os" 

A  further  result,  which  can  he  derived  from  our  model,  is 
an  expression  for  the  final  expected  output  noise-to-signal 
ratio  which  results  after  performing  an  FFT  and  an  inverse 
FFT  on  a  white  signal  ,v(»).  The  inverse  ITT  introduces  just 
as  much  roundolT  noise  as  the  FFT  itself,  and  thus  the  result¬ 
ing  output  noise-to-signal  ratio  is 

2 

~  =  W*  (76) 

or 

or  just  double  the  result  in  (75). 

In  order  to  see  the  implications  of  (75)  or  (76)  in  terms  of 
register  length  requirements,  it  is  useful  to  express  these 
results  ill  units  of  hits.  We  use 

C ok 2  os-o,2)  bits  =  £  log,  (2e)  (77) 

to  represent  the  number  of  hits  hy  which  the  rms  noise-to- 
signal  ratio  increases  in  passing  through  a  floating  point  FFT. 
For  example,  for  e  =  8  this  represents  2  hits  and  for  v  =  1 1  it 
represents  2.25  hits.  The  mimhei  of  hits  of  rills  noise-to-signal 
ratio  increases  as  log,  (log.  A'),  so  that  doubling  the  number 
of  points  in  the  I  FT  produces  t  very  mild  increase  in  output 
noise,  significantly  less  than  the  half-bit-per-stage  increase 
for  fixed-point  computation,  in  fact,  to  obtain  a  half-hit  in¬ 
crease  in  the  result  above,  we  would  have  to  double  v,  or 
square  X. 

In  the  analysis  leading  to  (75),  we  have  not  considered  the 
fact  that  multiplications  hy  1  can  he  performed  noiselessly, 
for  a  specified  radix-2  algorithm,  such  as  the  decimation  in 
time  algorithm  shown  in  fig.  10,  these  reduced  variances  for 
IF  =  1  and  j  can  he  included  in  the  model  to  obtain  a  slightly 
reduced  prediction  for  output  noise-to-signal  ratio.  However, 
for  reasonably  large  \,  this  modified  noise  analysis  yields  only 
slightly  better  predictions  of  output  noise  than  does  the 
simplified  analysis  above. 

A  consequence  of  our  analysis  leading  to  (75)  is  that  the 
output  noise  is  white.  This  follows  from  the  fact  that  each 


- Theorelicol  (simplified  onolysn)  . 

-  Theorelicol  (modified  onolysis)  / 

0  Experimental  (rondomijed  rounding)  j 

o  Experimental  (nonrondomned  rounding)  J 

- Filled  Quodrolic  Curve  j 

J 


Fig.  t-l.  (a)  ICxicerimcMilal  ami  tlicorclir.il  noisc-lo-signal  ratios  for  Hoot¬ 
ing  Poinl  FFT.  to)  K.xpo internal  and  lheorctical  noisc-lo- signal  ratios 
for  flonling-poit  i  FKT  ami  inverse. 


array  of  noise  sources  is  white.  I  lie  reduced  noise  source  vari¬ 
ance  for  If'  =  l  and./’  implies  that  for  some  arrays  there  will  he 
a  variation  of  noise  source  variance  over  the  array.  This  im¬ 
plies  a  slight  variation  of  output  noise  variance  over  the  out¬ 
put  array,  and  thus  our  modified  noise  analysis  will  only  pre¬ 
dict  an  average  noise  variance  over  the  output  array. 

The  results  discussed  above  hav  e  been  verified  with  excel¬ 
lent  agreement  as  shown  in  Fig.  15(a)  and  (b).  To  obtain  this 
agreement,  however,  it  was  necessary  to  use  randomized 
rounding,  i.e.,  randomly  rounding  up  or  down  when  the  value 
of  mantissa  was  exactly  (1  2)2~h.  The  modified  theoretical 
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Hr.  14.  Experimental  noise-to-signal  ratios  for  floating  point  KKT  and 
H’  1  -inverse  KP  T ;  truncation  used  instead  of  rounding. 


curve  shown  was  obtained  by  taking  into  account  reduced 
noise  source  variances  for  tf'  =  t  and  If'  =  j.  Also  shown  are 
experimental  results  for  nonrandotnized  rounding.  These 
results  were  fitted  empirically  with  a  curve  of  the  form  nv\ 
hut  this  quadratic  dependence  was  not  established  theo¬ 
retically.  Xoise-to-signal  ratios  were  also  measured  for  the 
case  where  truncation  rather  than  rounding  was  used  ir  the 
arithmetic;  the  results,  with  empirically  fitted  quadratic 
curves,  are  shown  in  Fig.  14. 

Our  analysis,  and  all  the  above  experiments,  applied  to  the 
case  of  white  signal.  Some  experimental  investigation  has 
been  carried  out  as  to  whether  the  predictions  are  valid  when 
the  signal  is  nonwhite.  Specifically,  the  noise  introduced  in 
computing  an  FbT  was  measured  for  sinusoidal  signals  of 
several  frequencies,  fori-  =  8,  9,  10,  and  11.  The  results,  aver¬ 
aged  over  the  input  frequencies  used,  were  within  15  percent 
of  those  predicted  by  (75).  In  these  experiments,  the  “ran¬ 
domized”  rounding  procedure  was  used. 

E.  Effects  of  Coefficient  Quantization  in  the  FFT 

As  with  the  implementation  of  digital  filters,  the  imple¬ 
mentation  of  the  FFT  algorithm  requires  the  use  of  quantized 
coefficients.  While  a  completely  definitive  study  of  the  effects 
of  coefficient  quantization  in  the  FFT  remains  to  he  done,  two 
approaches  have  been  pursued  for  which  some  results  have 
been  obtained. 

Although  the  nature  of  coefficient  quantization  is  inher¬ 
ently  nonstatistical,  Weinstein  [24]  has  obtained  some  useful 
results  by  means  of  a  rough  statistical  analysis.  This  sta¬ 
tistical  analysis  corresponds  to  introducing  random  jitter  in 
the  coefficients  and  determining  the  output  noise- to-signal 
ratio  due  to  this  noise.  While  the  detailed  effect  due  to  coeffi¬ 
cient  error  due  to  quantization  is  different  than  that  due  to  jit¬ 
ter,  it  is  reasonable  to  expect  that  in  a  gross  sense  the  magni¬ 
tude  of  the  errors  is  comparable. 

To  develop  this  statistical  analysis,  we  let  F(k)  denote  the 
DFT  of  a  sequence /(»)  and  F(k)  the  result  of  transforming 
f  in  with  a  radix-2  FFT  algorithm  with  jittered  coefficients. 


Then 


v-i 

F(*)  =  Z /<«)»'■*  (78) 

M-0 

and 


l-'(k)  =  Z/(»)!l,t.  (79) 

Because  of  the  form  of  the  FFT  algorithm  each  element  il„k 
will  be  a  product  of  r  =  log2  X  quantized  coefficients.  Thus 


where 


=  II  (H’«-  +  «,) 


(80) 


II  IF°<  =  H'»*  (81) 

1-1 

with  b  bits  for  the  real  and  imaginary  parts  of  each  of  the 
coefficients,  excluding  sign,  «,|  is  less  than  or  equal  to 
(\  2)2  .  If  we  assume  that  the  real  and  imaginary  parts  of 
the  jitter  in  the  coefficients  are  uncorrelated  and  uniformly 
distributed  between  plus  and  minus  (1/2)2'*  then  at\  the 
variance  of  5,  is  =  2-26  6.  The  error  in  the  computation  of 
the  DFT  can  be  expressed  as 

/((*)  =  F(k)  -  F(k)  =  £/(„)( -  IF"*).  (82) 

n—  0 

l-rom  (80)  and  (81)  we  can  express  the  factor  (I),,*-  IP*)  as 
(lint  -  II  "*)  =  z  n  IF0'  -f-  higher  order  terms.  (83) 

i-i  j- i 

If  we  neglect  higher  order  error  terms,  and  assume  that  j,- 
are  mutually  uncorrelated  then  the  variance  of  (U„k-  Ip*)  is 
equal  to  iff  2-*  6).  Finally,  assuming  that  all  elements  U„k  are 
uncorrelated  with  each  other  and  with  the  input  signal,  the 
output  error  variance  <te'‘  is 

2—26  .v-i 

**'  =  9  ~7~  21  I  /(")  Is-  (84) 

Since  from  1’arseval’s  relation 

£  i/<«>  i*  =  V£  i F{k)  i2 

n-0  A  „_0 

=  mean-squared  output  signal  (85) 

the  ratio  of  mean-squared  output  error  to  mean-squared  out¬ 
put  signal  is  thus 

-/[(^(l  I «*)!•)]-©« 

Although  we  would  not  expect  (86)  to  predict  with  great 
accuracy  the  error  in  an  h  1  T  due  to  coefficient  quantization, 
it  is  helpful  as  a  rough  estimate  of  the  error.  The  key  result  of 
(86),  which  we  would  like  to  test  experimentally,  is  that  the 
error- to-signal  ratio  increases  very  mildly  with  X,  being  pro¬ 
portional  to  v  =  logj  N,  so  that  doubling  N  produces  only  a 
slight  increase  in  the  error-to-signal  ratio. 
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l<>  tost  this  result,  experimental  measurements  on  errors 
due  to  eoellicient  quail ti/atinn  were  made.  In  each  run,  a 
sequence  /(.,)  white  in  the  sense  that  all  2.V  real  numbers 
making  up  the  .V-p  .int  complex  sequence  were  mutually  un- 
corre lated,  with  zero  means,  and  etpial  variances  was  ob¬ 
tained  using  a  random  number  generator.  This  sequence  was 
transformed  twice,  once  using  a  .16-1, it  coefficient  table,  and 
once  using  a  coefficient  table  rounded  to  much  shorter  word 
ength  (e.g„  12  bits).  For  each  transform,  .16- bit  accuracy  was 
used  ...  the  arithmetic  to  make  the  effect  of  roundoff  error 
negligible.  I  he  results  were  subtracted,  squared,  averaged 
<’ver  the  output  array,  and  divided  by  the  output  signal 
variance  ( A  times  the  input  signal  variance)  to  obtain  an 
experimental  output  error- to-signal  ratio.  For  each  value  of 

several  random  sequences  were  transformed  and  the  re¬ 
sults  were  averaged  to  obtain  statistically  convergent  csti- 
mat*'?. 

1  H'ft' rtMI,tS  ,m‘  ‘,‘Kl’!a>'l'<l  h'ig.  15:  the  quantity  plotted 
is  on  -  -  w  here  or2  is  the  mean-sipiared  output  signal  as 

do  lined  in  (8.x).  I  lie  theoretical  curve  corresponding  to  (86)  is 
shown,  and  the  circles  represent  measured  output  error- 1 
Hgnal  ra,l“  f,,r  t,,‘‘  fixed-point  case.  W  e  note  that  the  experi¬ 
mental  results  generally  lie  below  the  theoretical  curve  No 
experimental  result  differs  by  as  much  as  a  factor  of  two  from 
the  theoretical  result,  and  since  a  factor  of  two  in  oK*  < Tr*  cor¬ 
responds  to  only  half-a-l.it  difference  in  the  mis  output  error, 
it  seems  that  (86)  is  a  reasonably  accurate  estimate  of  the 
effect  of  coefficient  errors.  The  experimental  results  do  seem 
to  increase  essentially  linearly  with  c,  but  with  smaller  slope 
than  given  in  (86). 

In  the  above,  fixed-point  arithmetic  has  been  assumed, 
owover,  since  a  block  floating-point  FIT  will  generally  use 
hxed  point  coefficients,  our  results  are  valid  for  the  block 
oating  point  tase  also.  With  some  slight  modifications,  it  is 

IHiwiblt-  . . tain  similar  results  for  the  floating-point  case 

txeept  for  a  constant  factor,  the  floating-  and  fixed-point 
results  are  the  same.  Experimental  results  for  the  floating¬ 
point  case  are  represented  by  the  solid  dots  in  F'ig.  15,  and 


i  ur.  ir. r.r. ,  AlTil.ST 


are  observed  to  be  slightly  lower  than  the  results  for  the  fixed 
point  case. 

A  different  approach  to  the  characterization  of  |  |  J  coefti- 
ticnt  quantize  ion  has  been  taken  In  Tufts,  Mersey  and 

; j  "7r  |2<;  1  1  n  ,h;'ir  sis  1  ’w  ‘-ffwt  of  «K- fficie.it  quantiza 
•  -  represented  in  terms  of  the  level  of  spurious  sidDobes 

m  n  duml.  In  particular,  a  sequence  /<»)  =  «„(*- where 

;  a  unit  sample,  has  a  DIT  with  a  purely  sinus¬ 

oidal  real  and  imaginary  part,  i.e., 


/'(/•)  =  IF-* 


(87) 


and  the  inverse  DFT  of  /■'(*)  should,  of  course,  have  only  a 

single  nonzero  component.  Due  to  co-fficient  quantization 
however,  the  DFT  obtained  is  ttzat.on. 


m  =  <1- 


(88) 


and  since  the  real  and  imaginary  parts  are  not  exactly  sinus- 
"itl-  exact  coefficients,  of  f(k)  will 
have  spurious  components.  For  each  of  the  set  of  .V  sequences 

/*(")  =  _  M,)(  /■  =  0, 1, . .  . ,  A  _  1  (89) 

Tufts  e!  „/  compute  the  DFT  with  quantized  coefficients  fol- 
,0'  l‘d  "'V,TSe  l),'T  "ith  accurate  coefficients.  A,  the 

t;:;" ,,f  ":c  mv™  ,)kt-  ">*■  a„,i  frequency  iWa,i„„. 

I  tnt  spurious  sidelobe  components  produced  due  to  the 
quantized  coefficients  are  observed.  Since  any  function  /(„) 
can  be  constructed  as  a  weighted  sum 


v-i 


/(«)  =  Z  </*/*(«) 


(90) 


spurious  sulelobes  produced  for  any  /(,,)  can  in  principle 
he  determined  by  combining  the  responses  due  to  a  set  of  im¬ 
pulses  B„t  carrying  o„,  such  a  combination  is  not  practical 
for  arbitrary  fin).  Tufts  c,  have,  however,  tabulated  the 

"f  Hi 0  "  ,  “  n  ‘‘,S  ‘''’countered  for  any  /*(,,)  as  a  function 

the  number  of  bits  retained  in  the  coefficients,  for  the  case  of 

cimits’  ’""  an,i  s,s,t-"ingnitu.le  representation  of  coetfi- 

V  I  XAMI’l.KS 

■  I.  I iitrnilnr/ioii 

in  Hie  preceding  sections  the  effects  of  arithmetic  roundoff 
•axe  been  analyzed  for  simple  (first-  and  second  -order)  digital 
Urx  and  the  I- 1-  I  1  l.ese  algorithms  are  the  basic  building 
blocks  .a  more  complicated  digital  processing  such  as  a  higher 
order  digital  filter  or  a  convolutional  filter  realized  vis,  ,|,e 
H  I.  Examples  will  be  presented  in  this  section  to  indicate 
Sm,;e  "f  t  w  i‘l™«  ‘h" doped  above  can  be  applied  to  ana- 
y/\  and  to  choose  the  most  advantageous  configuration  for 
siuh  systems.  I  he  first  two  examples  concern  the  renlLation 
»f  higher  order  recursive  filters  and  have  borrowed  from  the 
work  of  other  authors.  The  third  example  Deals  with  an  I  F  T 

H.  Fixe, I -Point  l)iKi,„l  filter  /„  Cascade  and  Parallel  Form 

After  a  digital  filter  has  been  specified  in  terms  of  its  poles 
and  zeroes,  and  the  type  of  arithmetic  has  been  selected  a 
dunce  must  still  be  made  among  the  various  possible  con¬ 
figurations  of  the  filter  which  will  differ  with  respect  to  the 
effects  of  roundoff  noise.  An  exhaustive  stu.lv  of  the  selection 
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log  l(t  Sccimil-imler  section  will)  poll's  prcivtling  zeros.  I  si  ll  in  Jack 
son's  IP  parallel  inrra  with  <n-0.  Also  used  in  1 1)  direct  iiirni’ 


Fig.  1"  Second-order  section  will ■  zeros  preceding  jitiles.  t'seil  in  2P 
parallel  form  with  u>  =  ».  Also  used  in  21)  direct  form. 


of  III  ter  form  is  beyond  flic  scope  of  this  paper,  but  an  excellent 
example  of  the  necessary  considerations  is  given  by  Jackson 
| do]  1.1  his  analysis  of  roundoff  noise  for  fixed  point  digital 
filters  realized  in  cascade  ami  parallel  form 

Jackson  considers  two  parallel  form  realizations:  the  1 P 
form  where  the  individual  second-order  sections  are  realized, 
as  shown  in  lig.  16,  with  the  poles  preceding  the  zeros,  and 
the  2V  form  where  zeros  precede  poles  in  the  individual  sec¬ 
tions,  as  shown  in  I'ig.  17  (Figs.  16  and  17  do  not  show  ill  the 
scaling  coefficients  needed  to  prevent  overflow.)  His  analysis 
indicates  that  for  a  v  ariety  of  scaling  criteria  (based  on  tlif 
ferent  1.,,  norms2  of  the  input  signal)  ami  for  v  arious  measures 
of  the  output  noise  (such  as  its  total  power,  or  its  peak  spec¬ 
tral  value),  the  output  signal-to-noise  ratios  of  the  two  forms 
are  very  close.  Generally,  a  very  slight  advantage  will  be 
gained  with  the  1 !’  form. 

Comparison  of  the  two  parallel  forms  basically'  reduces  to 
a  comparison  of  the  noise  properties  of  the.  two  forms  of  sec¬ 
ond-order  sections,  since  the  noises  from  the  second-order  sec¬ 
tions  are  simply  additive  in  the  output  of  the  parallel  form. 
Hence  the  discussion  abov  e  applies  to  a  comparison  of  second- 
order  section  realizations.  Another  form  of  second-order  sre- 
tion  which  could  be  considered  is  a  coupled  form  as  shown  in 
f  ig  18.  For  the  case  a,  =  r  cos  0,  n.2  =  r  sot  0,  this  filter  has 
poles  at  z  =  re*)'  and  a  zero  at  s  =  r  cos  0.  The  coupled  form 
noise-to-signal  ratio  has  been  compared  [24]  to  ratios  for 
forms  essentially  the  same  as  those  in  Figs.  16  and  17  for  the 


*  If 

f  u)  -  £  ftm  exp  (  -jin  —  »  ) 

represents  the  Fourier  trails  orm  of  a  signal  or  of  a  filter  impulse  rosimnsc 
/(«),  then  tlie  correspond  g  l.p  norm  is 

!«„-(-  rw*.y,‘ 

\sjf  J  0  / 

where  w,  denotes  sampling  frequency. 


•i 


Fig.  IS.  Coupled  form  for  second-order  section. 

case  of  white  noise  input  and  an  absolute  overflow  constraint 
(through  the  type  of  analysis  given  in  Section  lll-C).  The 
coupled  form  was  found  to  have  substantially  lower  noise-to- 
signal  ratio  for  filters  with  high  gain  and  low  resonant  fre¬ 
quencies.  For  5=  1  —  r,  and  5«1,  the  results  vary  as 

OP 

1  6-  sin2  0  (forms  of  Figs.  16  and  17) 

<V 

a,.' 

1  o'  (coupled  form).  (91) 

<V 

1  he  implication  of  this  result,  together  with  the  somewhat 
reduced  coeflieicnt  sensitivity  for  the  coupled  form,  is  that 
this  form  in  iv  be  a  good  choice  in  some  situations,  despite  the 
fact  that  its  implementation  requires  four  multiplications 
instead  of  three  for  a  pole  pair  and  a  single  zero. 

As  stated  above,  Jackson  found  not  much  difference  be¬ 
tween  the  noise  properties  of  the  17'  and  21’  parallel  forms. 
However,  the  situation  is  mori  interesting  in  the  ease  of  the 
cascade  form.  Here  he  linds  that  large  differences  are  possible 
between  the  roundoff  noise  outputs  of  the  1/7  (poles  before 
zeros  in  individual  sections)  and  2D  (zero?  before  poles  in 
each  section)  forms.  Also  the  ordering  of  the  sections  and  the 
pairing  of  poles  and  zeros  have  important  effect  on  the  output 
signal-to-noise  ratio.  Jackson's  analyses  lead  to  several  rules 
of  thumb  for  selection  of  \D  or  2D,  for  ordering  of  sections, 
and  for  pairing  of  poles  and  zeros. 

In  general,  the  choice  of  eonfiguration  depends  on  which 
l.p  norm  of  the  sealed  transfer  functions  is  constrained  to  pre¬ 
vent  overflow  and  on  which  l.r  norm  of  the  output  noise  spec¬ 
trum  is  used  as  a  measure  of  performance.  Two  Lv  constraints 
on  the  Idler  are  of  particular  interest:  the  ,*>=  »  case,  where 
the  peak  value  of  the  transfer  function  to  each  possible  over¬ 
flow  node  is  constrained;  and  l lie  p  =  2  ease  where  the  rill? 
transfer  function  to  each  node  is  constrained.  The  choice 
p  =  x  is  just  slightly  less  stringent  than  the  absolute  overflow 
constraint  (7),  and  prevents  overflow  even  when  the  input  is 
a  narrow-band  signal  at  resonance  of  the  relevant  transfer 
function.  The  p  =  2  constraint  is  more  appropriate  for  prevent¬ 
ing  overflow  when  the  input  is  vvi  'e-band  in  nature.  Two  L, 
norms  on  the  output  noise  spectrum  are  of  particular  interest: 
the  r=l  norm  which  measures  the  total  output  noise  power, 
and  the  r==o  norm  which  measures  the  peak  value  of  the 
spectrum  of  the  output  noise. 

With  regard  to  selection  of  1 D  or  2D  forms,  Jackson's  rule 
of  thumb  says  to  select  1 1)  when  p  =  2,  r=oc  and  2D  when 
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p=  00,  r=  1;  for  p-  2,  r=  \  and  for  />=*  ,  /•=  x  ,  either  form 
may  lit-  selected. 

I  ho  choice  of  ordering  of  sections  also  depends  on  the 
norms  wliirli  are  selected,  h or  p  =  2,  r  =  x  ,  the  sections  should 
he  ordered  in  dec reusing  peakedness,  where  peakedness  is 
debited  as  the  peak  gain  of  a  section  divided  In  its  rms  gain, 
hor  p—  ■x  ,  r=  1,  the  sections  should  he  ordered  in  increasing 
pcakedness.  hor  p  =  2,  r=  1  ami  for  />  =  x,  r  =  r.  ,  the  choice 
of  ordering  depends  on  w  hether  form  2/7  or  1/7  is  chosen  for 
the  individual  sections.  Decreasing  peakedness  should  lie 
chosen  with  form  2/7,  and  increasing  jieakcdncss  with  form 
1/7 

I  he  rule  for  pairing  of  poles  and  zeros  is  as  follows:  l.et 
//»(“)  <lenote  the  magnitude  of  the  frequency  response  of 
the  nth  section,  and  .1/,,  denote  the  maximum  over  at  of 
//»(") I  -  T,u'n  the  pairing  should  he  chosen  such  that  the 
maximum  over  n  of  .!/„  is  minimized. 

The  above  rules  are  illustrated  by  Jackson  with  a  specific 
filter  example  a  sixth-order  (  hehyshev  band  rejection  filter, 
and  the  results  are  in  accord  with  his  rules.  He  analyzes  the 
output  noise  of  this  filter  for  parallel  forms  1 P  and  IP  and  for 
all  orderings  of  cascade  forms  1/7  and  2/7  (with  proper  pole- 
zero  pairing).  Little  difference  is  seen  between  the  two  parallel 
forms,  hor  p=  2,  r=  x  the  peak  output  noise  spectrum  is  7-12 
dll  worse  for  the  2/7  cascade  forms  than  for  the  1/7  forms; 
while  for  p=^-,  r=l,  the  output  noise  power  is  7  12  dll 
worse  for  1/7  than  for  2/7.  The  effects  of  pole-zero  pairing  and 
of  ordering  of  sections  also  follow  quite  well  the  rules  previ¬ 
ously  stated.  The  parallel  forms  turn  out  to  be  slightly  supe¬ 
rior  to  the  best  cascade  forms  with  respect  to  roundoff  noise. 

As  Jackson  indicates,  these  rules  of  thumb  haw  certain 
qualifications  and  are  not  always  valid.  However,  they  have 
been  shown  to  be  helpful  in  a  variety  of  types  of  examples  |.1|  ]. 

C.  Choice  of  Form  for  Floating-Point  Digital  Filter 

H>  means  of  an  example,  l.iu  and  Kaneko  [8]  have  com¬ 
pared  the  direct,  cascade,  and  parallel  f<  rill  realizations  of  a 
floating-point  digital  filter.  The  filter  selected  was  an  eighth- 
order  low-pass  elliptic  filter.  The  noise-to-signal  ratio  for  the 
parallel  form  was  about  20  <11$  worse  than  for  the  direct  form, 
while  the  cascade  form  was  comparable  to  (about  1.5  <11$ 
worse  than)  the  parallel  form. 

Various  ordering*  of  cascade  form  floating-point  filters 
have  not  been  studied  in  detail.  Probably  floating-point 
cascade  filters  are  not  too  sensitive  to  ordering  since  large 
variations  in  signal  level  from  stage  to  stage  can  be  accom¬ 
modated  by  the  floating-point  exponent. 

A  comparison  of  the  noise-to-signal  ratio  properties  of 
floating-point  second-order  sections  where  poles  precede 
zeros  (Fig.  16)  and  where  zeros  precede  poles  (Fig.  17)  indi¬ 
cates  that  at  least  for  white-noise  inputs  the  behavior  of  the 
two  forms  is  essentially  identical  Fora  high-gain  second-order 
section  of  low  resonant  frequency,  a  coupled  form  realization 
yields  some  noise-to-signal  ratio  advantage  over  both  of  these 
two  forms. 

D.  FFT  Filter 

The  results  of  our  roundoff  noise  analysis  for  fixed-point 
LFT  will  now  be  applied  to  obtain  an  expression  for  the  out¬ 
put  noise-to-signal  ratio  of  a  finite  impulse  response  digital 
filter,  implemented  by  means  of  the  FFT.  The  overflow  con¬ 
straints  of  this  type  of  filter  will  he  accounted  for  in  the  analy¬ 


sts.  Attention  will  be  focussed  on  a  prototype  low  pass  filter 
with  256-point  impulse  response  and  a  cutoff  frequency  of  1  4 
the  half  sampling  frequency.  Rounded  arithmeth  will  be 
assumed. 

I  .et  us  examine  t  lit  basic  steps  in  the  filtering  computation, 
tracing  the  buildup  of  noise  variance  as  we  proceed  I  irst  the 
b  I  T  is  used  to  compute  the  I) FT  of  a  section  of  input  In  the 
implementation  of  a  filter  with  256  point  impulse  response,  it 
is  reasonable  to  compute  a  512-point  FFT,  where  the  input 
consists  of  256  dita  samples  and  256  zeros.  Actually  512  real 
input  samples  would  be  treated  simultaneously,  bv  placing 
sections  of  256  real  samples  in  both  the  real  and  imaginary 
parts  of  the  input  to  the  111.  To  guarantee  against  overflow, 
a  scaling  of  1  2  is  needed  at  each  stage  of  the  FFT  yielding 
an  overall  attenuation  of  1  512.  Tile  samples  of  the  input 
sequence  must  be  less  than  unity  in  magnitude.  The  noise 
/-i(/)  at  the  out  pul  jf  this  first  Df  L  has  variance 

=  /•;,(*)  |5  =  -  2~s\  (92) 

This  noise  variance  is  small,  because  most  of  the  roundoff 
noise  has  been  shifted  off  by  the  attenuations.  However,  the 
scalings  have  also  caused  the  mean -squared  signal  to  decrease 
by  a  factor  of  1  512. 

Next  this  computed  transform  is  multiplied  by  a  sequence 
//(/•)  representing  the  DI  P  of  a  512-point  sequence  /?(;;)  con¬ 
sisting  of  the  filter  impulse  resoonse  plus  256  zeros.  This  com¬ 
plex  multiplication  introduces  roundoff  noise  of  variance 
2  *  .$.  Assuming  that  we  have  chosen  |  //(it) |  <1,  the  mean 
square  of  the  noise  i.\  becomes  reduced  bv 

1  iU 

■  n£  H(k)"-=n  (95) 

a  ratio  of  the  filter  bandwidth  to  the  sampling  frequency. 
Thus  after  the  multiplication.,  the  variance  of  the  total  noise 
/-•••(*)  is 

2  «  5 

?  +  />  -  2_s\  (94) 

I  his  noise  is  not  white,  but  has  a  component  whose  spectrum 
has  been  shaped  by  the  filter. 

For  the  example  under  consideration  /$  =  1  4,  so 

one  «  ;  2  (95) 

4 

Note  that  a/.;,-  is  slightly  less  than  <r 1  and  represents  only 
about  a  bit  of  noise.  However,  if  the  signal  spectrum  is  flat, 
the  mean-squared  signal  will  also  be  reduced  somewhat  due 
to  the  multiplication  by  //(/’). 

Now  an  inverse  transform  is  computed  to  obtain  a  section 
of  output.  The  noise  variance  at  the  output  of  this  transform 
depends  on  how  many  scalings  are  necessary  in  the  inverse 
111.  In  order  to  determine  how  many  scalings  are  necessary, 
a  bound  on  the  output  of  the  circular  convolution  is  required 
[.$2],  Lor  a  particular  (liter,  such  a  bound  can  lie  stated  as 

;  v(«)  <  Z  HI)  |  (96) 

/  ■(> 

where  v(«)  is  the  output  and  .1/  is  the  length  of  the  impulse 
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response  I'hc  prototvpe  filter  lias  an  impulse  response 
2  n  si 


*(«)  = 


256 


1  “  Ixkn 

+  E  (-i)'ios  — - 

.2  (-..I)  2.i6 


2tr(52)«  .  2ir  (88);/ 

+  0.7  cos  —  0.225  cos - 

256  256 


+  0.01995  cos 


2+88)//" 

256 


and 


(97) 


E  I  *(«) !  =  -U2. 


(98) 


Hence,  only  two  scalings  (at  the  first  two  arrays)  are  neces¬ 
sary  in  the  inverse  transform.  Then,  in  propagating  through 
the  IFFT  (inverse  FFT),  the  variance  of  the  noise  E>(k) 
increases  by  a  factor  of  512/16.  (The  512  represents  the  gain 
of  the  inverse  DFT,  and  the  1/16  is  due  to  the  scalings.) 
The  variance  of  the  additional  output  noise  /-+  k)  caused  by 
roundoff  in  the  IFFT  can  lie  estimated  easily  via  the  method 
of  Section  I V-C  The  result  is 

/512  512\  " 

OF*  =  <r,rJ( - + - )  +  cr/i2  £  2*  =  (202)2-26.  (99) 

\  8  4  /  jui 

The  total  mean-siptared  output  noise  is 

512 

<rf:-  =  —  ofJ  +  <rt:*  =  (226)2"2b  (100) 

16 


or  in  units  of  bits  of  rms  output  noise 

J  log-.  (2 JW)  =  8.91  bits.  (101) 

The  niean-scpiared  output  signal  can  be  estimated  if 
specific  statistics  are  assumed  for  the  input  signal  .v(«).  As  an 
example,  assume  that  ,v(»)  is  white  with  variance  a/ =  2  8. 
This  variance  goes  through  an  attenuation  of  1  512  in  the 
lirst  I‘  FT,  an  attenuation  of  li  =  1  4  due  to  multiplication  by 
//(£),  and  a  gain  of  512  16  in  the  inverse  transform.  The 
mean-squared  output  signal  is  then 


and  the  output  noise-to-signal  ratio  is 

(T  ~ 

—  «  (22  000)2  ■b.  (108) 

Assuming  an  input  noise-to-signal  ratio  (due  to  A  I)  noise) 
of  (1  4)2"*,  the  noise-lo-signal  ratio  has  worsened  by  a  factor 
of  about  5500,  or 

’>  log.,  5500  =  6.15  bits  (1(14) 

in  passing  through  the  FFT  filter. 
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